2011-01-01

MicroXML and JSON

Warning: You don't know about MicroXML without you have read a blog post by the name of "More on MicroXML"; but that ain't no matter, because you can click on the link and read all about it.

Warning Too: Carefully note the word "and" in the title. There's a reason why it's not "versus".

The whole point of MicroXML is to provide an XML spec (and associated data model) which is small and simple enough, and easy enough to implement, that it can go where no XML has gone before. Of course, JSON is already filling part of that niche, and it's even simpler than MicroXML. So MicroXMLers have two choices: think up reasons why JSON is bad, or figure out ways to coexist with it. My personality being what it is, I choose the second.

The goals of this posting are a) to specify a way to losslessly and uniquely transform JSON documents into MicroXML documents and back, and b) to specify a way to add markup to an arbitrary MicroXML document to explain how to transform it to JSON, which probably involves some amount of loss, because if MicroXML weren't more expressive than JSON, it wouldn't have a reason to exist. Consequently, a non-goal is to specify a way to losslessly and uniquely transform MicroXML to JSON and back.

JSON values have six possible types: objects (key-value mappings), arrays (ordered lists of values), strings, numbers, booleans, and null. The simplest approach to the first goal that could possibly work is to define a MicroXML vocabulary with six elements in it, named object, array, string, number, boolean, and null, and that's what I'm going to specify. So JSON converted to MicroXML looks pretty much like JSON itself, only more verbose. Why do this at all? So that the converted JSON can be fed into a MicroXML-based or XML-based pipeline and possibly converted back to JSON at the other end. Of course, if you don't need to do that, no problem: just don't convert to MicroXML in the first place.

Five of the six types are easy to represent: an array element represents the elements of the array using its child elements; a string, number, or boolean element contains the string, number or boolean value as character content, and a null element is always empty.

Next we must choose how to represent the key-value pairs within an object. They can't be represented as attributes (that is, with the key as the attribute name and the value as the attribute value), because the JSON RFC only says that keys SHOULD be unique, not that they MUST be unique, and attribute names in XML elements MUST be unique. So we'll represent each key-value pair as a child element, and represent the value of the pair using the content of the element.

But what about the key? There are two plausible choices: use an element with the fixed name pair and specify the key (which must be a string) using a key attribute, or use the name of the element directly as the key. The first solution is general but verbose; the second solution is not general, because only a subset of strings can appear as a MicroXML (or XML) element name. We'll require MicroXML-to-JSON converters to accept both (be liberal in what you accept), but require JSON-to-MicroXML converters to use the second solution unless the key contains a character that's not valid in XML names (be conservative in what you send). So pair becomes a seventh name in the MicroXML vocabulary for JSON.

(The characters U+FFFE and U+FFFF can appear literally in a JSON string, key, or value, but can't appear in XML character content, not even using character references. These aren't likely to actually occur in JSON documents, but just for completeness we'll say that they must be escaped with JSON escaping as \uFFFE and \uFFFF. This constitutes a minor violation of the rule of verbatim round-tripping, since JSON->MicroXML->JSON will always produce escape sequences for these characters even if the original document had them appear literally, but no realistic JSON application will notice the difference.)

So much for the first goal. What about the second? We'll require JSON->MicroXML translators to adopt the rules above to begin with. What about elements and attributes present in the MicroXML that have other names? We'll say that if an element has the attribute json-type, then the value of that attribute tells us how to process it. Thus an element named list with a json-type attribute of array will be converted to a JSON array. In this process, the actual name of the element and any inappropriate content is discarded, including any character content in an element with a json-type of object or array and any content at all of an element with a json-type of null. We don't discard child elements in elements with a json-type of string, number, or boolean: instead we use the XPath value of the element, which is the same as the content of the element with any tags ignored.

What about MicroXML attributes? We discard them for all elements except those with a json-type of object, where we treat them as additional key-value pairs (excepting of course any json-key attribute).

As usual, comments are solicited.

10 comments:

Keith Gaughan said...

"Next we must choose how to represent the key-value pairs within an object. They can't be represented as attributes, because the JSON RFC only says that keys SHOULD be unique, not that they MUST be unique."

This, at least at first glance, reads a bit like a non-sequitur: how does representing a key as an attribute force keys to be unique? This markup fragment seems perfectly acceptable to me:

<dict>
<string key="foo">bar</string>
<int key="foo">42</int>
<list key="baz">
<string>fred</string>
<string>barney</string>
</list>
</dict>

Am I misunderstanding you?

John Cowan said...

Ah. Yes, you are. I meant that we can't represent the key-value pairs as attributes; that is, using the attribute name for the key and the attribute value for the value.

I've clarified the text.

Keith Gaughan said...

Ah, that's clearer alright! I don't know if you remember it you seem to be creating a subset of WDDX, which was an attempt half an age back to create something JSON-like, before JSON existed, in XML.

John Cowan said...

Keith left a further comment, but Blogger seems to have dropped it on the floor, luckily after it sent me an email copy. It read:

"Ah, that's clearer alright! I don't know if you remember it you seem to be creating a subset of WDDX, which was an attempt half an age back to create something JSON-like, before JSON existed, in XML."

I had never heard of WDDX (different link) before, but it is scary close to what I'm describing, with the addition of a date-time (ISO 8601) simple type and a "recordset" aggregate type, in JSON terms an array of objects constrained to hold simple types only. If JSON had been designed from scratch without reference to JavaScript syntax, I'd guess it would have had date-time literals as well.

Stephen D Green said...

It interests me that one of your conclusions is to discard MicroXML attributes apart from a special 'json-type' attribute.
Firstly this suggests that there would be value if either a) there were yet a further subset of MicroXML which excluded attributes arat from certain special attributes or b) that maybe MicroXML itself should exclude attributes apart from certain special attributes (one might continue to be 'xmlns'). Your suggestion seems to favour, rather than a modified MicroXML, a new vocabulary but I wonder if a new profile (NanoXML?) or adjustment to the MicroXML profile would better meet the need of being able to map to and from JSON.
Secondly there is the matter of those special attributes of which 'json-type' might be one. Should the 'xmlns' attribute be another? How best could a new special attribute like 'json-type' be included. I guess if you opt for a vocabulary over MicroXML then it would just be an attribute in the vocabulary but if the profile itself (Micro or NanoXML) is preferred then you might want it to include some useful new reserved attributes if it could get sanction at that level. Maybe something like 'xsi:type' is in order but perhaps profiled for MicroXML or NanoXML to avoid the use of the prefix - perhaps like the reserved 'xml-' in 'xmlns' (say 'xmljt' as a shortened 'xmljsontype'). Or do I misunderstand?

Stephen D Green said...

One more comment: If there is a subset of MicroXML (whatever you call it but MicroXML without attributes except certain special ones) which can be mapped (with help from those special attributes) to JSON, is there also a subset or potential profile to identify (and name?) in JSON which best supports a roundtrip mapping? If the former is dubbed something like NanoXML, maybe the latter could be dubbed NanoJSON and the profiles of NanoXML and NanoJSON include the essential details for mapping them (including the special attribute(s)).

John Cowan said...

MicroXML without attributes (in either your (a) or (b) style) would have somewhat less expressive power than JSON, because unlike JSON it would not be able to express types easily, so I see no point in having it. Converting MicroXML to JSON would inherently be a down-level conversion. I can't say I see any need for MicroJSON either: JSON is already small enough.

I thought of xsi:type rather than json-type, but MicroXML currently doesn't allow prefixed attributes (I think it should, with or without namespace declarations). What's more, xsi:nil="true" would be the natural mapping of JSON null, but now we have two different attributes for types.

Stephen D Green said...



How about a set of special attributes (which do not have values) with reserved
names (e.g. starting with 'xml', if that were sanctionable) which when added to
an element, declare for it a type (which maps to a JSON type):




object: <n xmljto>v</n>


array: <n xmljta>"v1",'v2 v3'</b>


sequence: <n xmljtq>"v1",'v2 v3'</b>


string: <n xmljts>v</n>


number: <n xmljtn>v</n>


boolean: <n xmljtb>v</n>


null: <n xmljtl/> <n xmljtl></n>




I think this would go some way to solving both of your goals, wouldn't it, but
without requiring a special vocabulary - and that way the XML can have any
vocabulary, except that attributes pose a problem.


For attributes there could be a convention like starting the value of the
attribute with the reserved name followed by some special character (say a
colon):




object: <a n="xmljto:v"/>


array: <a n="xmljta:'v1','v2 v3'"/>


sequence: <a n="xmljtq:'v1','v2 v3'"/>


string: <a n="xmljts:v"/>


number: <a n="xmljtn:v"/>


boolean: <a n="xmljtb:v"/>


null: <a n="xmljtl:"/>




or something like that.

John Cowan said...

Attributes without values are an SGML/HTML thing (really, the attribute name is the same as the attribute value), but not part of XML. I really don't want a MicroXML that isn't a subset of XML.

Stephen D Green said...

Apologies, getting forgetful there.

I guess it has to be an attribute with a codelist of types then. Maybe reserved attributes could be specified in the MicroXML profile: xmlns, xmltype (with some codelist specified like xmltype="json:object|json:array|json:sequence|json:number..." perhaps including other types too like xsd:normalizedString). These attributes would be sprinkled in with a custom vocabulary but there is a downside of the attribute not being recognised in that vocabulary. An alternative would be type-declaring values added as reserved comments to the elements which would then avoid making the XML invalid. , or , etc.
Besides this there could be defaulting rules like every
* element with untyped text content is assumed to be type 'string';
* every element with mixed content with untyped text content is assumed to be a combination of type json:object and a reserved named JSON 'string' item (perhaps named 'xmlstring')
* every attribute with untyped content is assumed to be type 'string'.