Wednesday, September 15, 2010

Re: [Geopriv] Format of SSID in held-measurements-01.txt

Martin,

On Sep 15, 2010, at 12:52 AM, Thomson, Martin wrote:
> David Waitzman writes:
>> As I expressed in my first message, users don't normally look at raw
>> XML contents:
>
> This is an interesting argument, and one that I've seen used quite frequently (I've used it myself on occasion).
>
> When applied to the general question of text-vs-binary protocols, the answer tends to be entirely subjective. Taken to its extreme, you end up with ASN.1+PER or similar solutions.
>
> I favour the protocol being usable in the default case. Even if that sort of usage seems unlikely, it's been hugely beneficial in debugging thus far.

I appreciate your argument: I used to be able to read parts of hex dumps of ASN.1 BER-encoded data from SNMP. I cede the point that having human readability can help.

>> Your suggestion in msg08717.html of: [...]
>>>>> octetsAsString = *(vcharx / escaped)
>>>>> vcharx = %x21-5b / %x5d-7e ; VCHAR minus backslash
>>>>> escaped = %x5c 2HEXDIG
>>
>> doesn't handle the UTF-8 encoding cases well.
>
> LDAP solves this by adding UTFMB, a pattern that allows for multi-byte UTF-8 sequences:
> octetsAsString = *(vcharx / escapted / UTFMB)
>
> In XML, the easiest solution is slightly different, and even quite simple:
> ([^\\]|\\[0-9a-fA-F]{2})*
>
> That is, everything other than backslash, or backslash plus two hex characters. To convert from a raw token value to a sequence of octets, UTF-8 encode all characters except backslash, then replace all backslash-escaped sequences with a single octet. In reverse, decode UTF-8 and backslash-escape anything that doesn't decode, plus backslash.
>
> The benefits are clear enough:
>
> <ssid>ManufacturerName</ssid>
>
> ...as opposed to:
>
> <ssid>4d616e7566616374757265724e616d65</ssid>
>
> ...with the occasional:
>
> <ssid>Wêird\5cNàmé</ssid>

But there's a need to be defensive about illegal UTF-8 sequences, including 0 byte values.
See rfc3629.txt section 6 and also http://etutorials.org/Programming/secure+programming/Chapter+3.+Input+Validation/3.12+Detecting+Illegal+UTF-8+Characters/

I suggest a more defensive algorithm: anything not printable ASCII gets hex encoded. That won't handle Wêird\5cNàmé but it will handle ManufacturerName and \20\00\ff\fbUgly\00\00\00.

--
David Waitzman
BBN Technologies
djw@bbn.com 410-290-6160

_______________________________________________
Geopriv mailing list
Geopriv@ietf.org
https://www.ietf.org/mailman/listinfo/geopriv