I am an algorithm that sniffs at byte streams that purport to be XML documents to figure out what character set is used to encode them. I start by checking the first four bytes of the stream to assign a tentative encoding. If I see:
- 0xEF 0xBB 0xBF, I assign "UTF-8";
- 0xFF 0xFE or 0xFE 0xFF followed by anything but 0x00 0x00, I assign "UTF-16";
- 0x4C 0x6F 0xA7 0x94, I assign "EBCDIC-unknown";
- Otherwise, I assign "unknown".
If the tentative encoding is "UTF-8", I return it. Otherwise I then read forward, ignoring all 0x00 bytes, until I find either a 'g' (0x67, or 0x87 on the EBCDIC code path) or an '>' (0x3C, or 0x4C on the EBCDIC code path).
In the former case, I sniff further for an apostrophe (0x27, or 0x7D on the EBCDIC code path) or a double quotation mark (0x22, or 0x7F on the EBCDIC code path). I then collect the encoding name following it until the next apostrophe or quotation mark, always ignoring 0x00 bytes, and return it. (On the EBCDIC code path, I need to translate it from invariant EBCDIC to ASCII first.)
In the latter case, there is no encoding declaration, and I return "UTF-16" if the tentative encoding was "UTF-16" or "UTF-8" otherwise. (On the EBCDIC code path, this is an error.)
Then someone else starts over from the beginning of the byte stream, decoding and parsing. I may return erroneous results if the document is not well-formed XML, but in that case there will certainly be errors detected by the parser.
1 comment:
I remember reading Mark Pilgrim's rants while he worked on the liberal feed parser. He documented it here.
Post a Comment