Writing out XML

You can't just embed plain text into an XML element or attribute; character content and attribute values have to be escaped in a number of ways, not necessarily obvious. Here's a checklist of things to make sure to do. (Once again, this post will look terrible in RSS readers that don't fully understand Atom.)

  1. Escape all & characters as &.
  2. Escape all < characters as &lt;.
  3. Escape all > characters as &gt;. Technically it's enough to do so only when they are preceded by ]] in character content, but in my opinion making that check is more trouble than it's worth.
  4. Escape all carriage-return characters as &#xD;. These should be very rare in XML content, as they will have been converted to line-feeds on parsing.
  5. Escape all tab characters in attribute values as &#x9;. You can escape them in character content if you want, but it's not necessary.
  6. Escape all line-feed/newline characters in attribute values as &#xA; (not D as I first wrote).
  7. Output all line-feed/newline characters in character content as the local line terminator: carriage-return (on Mac Classic), line-feed (on Unix) or both (on Windows). You can provide alternative line terminators at user option.
  8. Escape all characters that can't be represented in the output character set. If the output character set is UTF-8 or UTF-16 (in any flavor), this step is not necessary.
  9. Directly output everything else.

I'm glad to say that XOM, my favorite XML tree representation, does all these things in its Serializer class.

1 comment:

Jakub said...

In attribute value you should also escape SPACE character, see http://www.w3.org/TR/xml/#AVNormalize