You can't just embed plain text into an XML element or attribute; character content and attribute values have to be escaped in a number of ways, not necessarily obvious. Here's a checklist of things to make sure to do. (Once again, this post will look terrible in RSS readers that don't fully understand Atom.)
- Escape all & characters as &.
- Escape all < characters as <.
- Escape all > characters as >. Technically it's enough to do so only when they are preceded by ]] in character content, but in my opinion making that check is more trouble than it's worth.
- Escape all carriage-return characters as 
. These should be very rare in XML content, as they will have been converted to line-feeds on parsing.
- Escape all tab characters in attribute values as 	. You can escape them in character content if you want, but it's not necessary.
- Escape all line-feed/newline characters in attribute values as 
 (not D as I first wrote).
- Output all line-feed/newline characters in character content as the local line terminator: carriage-return (on Mac Classic), line-feed (on Unix) or both (on Windows). You can provide alternative line terminators at user option.
- Escape all characters that can't be represented in the output character set. If the output character set is UTF-8 or UTF-16 (in any flavor), this step is not necessary.
- Directly output everything else.
I'm glad to say that XOM, my favorite XML tree representation, does all these things in its Serializer class.
1 comment:
In attribute value you should also escape SPACE character, see http://www.w3.org/TR/xml/#AVNormalize
Post a Comment