2008-02-07

Which characters are excluded in XML 5th Edition names?

The list of allowed name characters in the XML 1.0 Fifth Edition looks pretty miscellaneous. The clue to what's really going on is that unlike the rule of earlier XML 1.0 versions, where everything not permitted was forbidden, now everything that is not forbidden is permitted. (I emphasize that this is only about name characters: every character is and always has been permitted in running text and attribute values except the ASCII controls.)

So what's forbidden, and why?

  • The ASCII control characters and their 8-bit counterparts. Obviously.
  • The ASCII and Latin-1 symbolic characters, with the exceptions of hyphen, period, colon, underscore, and middle dot, which have always been permitted in XML names. These characters are commonly used as syntax delimiters either in XML itself or in other languages, and so are excluded.
  • The Greek question mark, which looks like a semicolon and is canonically equivalent to a regular semicolon.
  • The General Punctuation block of Unicode, with the exceptions of the zero-width joiner, zero-width non-joiner, undertie, and character-tie characters, which are required in certain languages to spell words correctly. Various kinds of blank spaces and assorted punctuation don't make sense in names.
  • The various Unicode symbols blocks reserved for "pattern syntax", from U+2190 to U+2BFF. These characters should never appear in identifiers of any sort, as they are reserved for use as syntactic delimiters in future languages that exploit non-ASCII syntax. Many are assigned, some are not.
  • The Ideographic Description Characters block, which is used to describe (not create) uncoded Chinese characters.
  • The surrogate code units (which don't correspond to Unicode characters anyhow) and private-use characters. Using the latter, in names or otherwise, is very bad for interoperability.
  • The Plane 0 non-characters at U+FDD0 to U+FDEF, U+FFFE, and U+FFFF. The non-characters on the other planes are allowed, not because they are a good idea, but to simplify implementation.

Note that the undertie and character tie, the European digits 0-9, and the diacritics in the Combining Characters block are not permitted at the start of a name. Other characters could have sensibly been excluded, particularly combining characters that don't happen to be in the Combining Characters block, but it simplifies implementation to permit them.

This list is intentionally sparse. The new Appendix J gives a simplified set of non-binding suggestions for choosing names that are actually sensible.

7 comments:

James Abley said...

Or to put it in simpler terms for people like me, they've switched from a whitelist (Name is anything in this set) to a blacklist (Name is anything except items in this other set), yet they've retained the same version number. What the hell are they smoking!? From a Unicode perspective it feels like a good idea, but the 1.0 version number? Come on!

John Cowan said...

Well, if you want to use XML 1.1, don't let any of us stop you. The trouble is that far too many other people don't.

James Abley said...

That's kind of my point. I do want to stick with XML 1.0; XML 1.1 doesn't appear to offer anything of particular value in the kind of work I do; but the definition of the Name production being retrospectively changed for XML 1.0 feels very wrong.

John Cowan said...

Well, to me it "feels wrong" and always has that some people can use their native language, or the official language of their country, in making up element and attribute names, and others cannot. I tried to solve the problem the right way with XML 1.1; now I'm trying to solve it the wrong way with Fifth Edition.

James Abley said...

+1 - I couldn't agree more and have always thought that. If you look at XML and character encoding issues, things can look slightly racist; maybe racist is too strong. Let's say they can exhibit the sort of thinking that I would label (in a very self-aware fashion) as being constrained by American[1] closed world thinking that isn't aware of anything outside of North America. And I'm as guilty as the next person of doing that. I've written code to the XML standard, without stopping to think "Hang on a minute, this is wrong and I should strive to do better.". In a few years time, people won't give a damn that the spec was amended in this way; all that will matter is that it does the Right Thing, and continues to do so in the future. Thanks for this conversation; I'd like to reverse and consolidate my position as supporting and applauding this change, and will send an email to the W3C list to that effect.

[1] Note that's not a slur on Canadians!

Roger S said...

I'm glad to see the coeng (U+17D2), introduced simply for electronic keying of Khmer, is permitted.

But I wonder why go to the trouble of specifically excluding Latin-1 symbols but not, say, Khmer divination symbols (U+19E0 - U+19FF) and punctuation (U+17D3 - U+17DA).

It also seems to allow a number to begin a name, so long as it's not an Arabic number. I see, however, in Section J, suggestions for XML names, that lacking the Unicode property of ID_Start, Khmer digits are frowned upon.

But why specify some digits and not provide a complete list? Same logic as with the Combining Characters, to keep the list sparse?

Roger

John Cowan said...

rsperberg:

I wonder why go to the trouble of specifically excluding Latin-1 symbols but not, say, Khmer divination symbols (U+19E0 - U+19FF) and punctuation (U+17D3 - U+17DA).

Too much trouble, basically. We wanted a simple table of exclusions that would never change, and didn't want to bother cherry-picking them. Unicode already does that, and App. J tells you to follow Unicode.

But why specify some digits and not provide a complete list? Same logic as with the Combining Characters, to keep the list sparse?

Exactly. Besides, when we encode the next script, the list becomes
incomplete again, and parser authors don't want to have to play catch-up.