Recycled Knowledge: 2011/01

2011-01-25

MicroXML Editor's Draft

I have assembled a very preliminary Editor's Draft for MicroXML. It is about 20 pages long, and covers for MicroXML what is covered by XML 1.0 (Fifth Edition), Namespaces in XML 1.0 (Third Edition), XML Infoset (Second Edition), Namespaces in XML 1.0 (Third Edition), xml:id 1.0, and XML Base (Second Edition), which total about 82 pages.

It still needs formal normative references, and there are a few open issues listed in section 10. Comments and corrections are earnestly solicited either in comments here or in email.

2011-01-23

The Hoose at Pooh's Neuk / Kidnappit

Well, I've finished the second Pooh book in Scots, and it seems to me just a hair less good than the original. I have no idea if that's the translator, the author, or just me. Definitely it's harder to understand, perhaps more Scots and less English, though not so much you'd notice it right off. I do, however, want to present one of Pooh's hums in three versions: Robertson's translation, a literal back-translation by me, and Milne's original. It comes from Chapter VIII, "Where-intil Wee Grumphie does a Gey Graund Thing":

I lay on ma chist
   I lay on my chest
      I lay on my chest
And thocht I wid jist
   And thought I would just
      And I thought it best
Pit on I was haein a sleep I had missed;
   Put on I was having a sleep I had missed;
      To pretend I was having an evening rest;
I lay on ma wame
   I lay on my belly
      I lay on my tum
Some verse tae declaim
   Some verse to declaim
      And I tried to hum
But naethin particular seemed to strike hame.
   But nothing particular seemed to strike home.
      But nothing particular seemed to come.
Ma face was flat
   My face was flat
      My face was flat
On the flair, and that
   On the floor, and that
      On the floor, and that
Is aw guid and weel for an acrobat;
   Is all good and well for an acrobat;
      Is all very well for an acrobat;
But it doesna seem fair
   But it does not seem fair
      But it doesn't seem fair
Tae a Freendly Bear
   To a Friendly Bear
      To a Friendly Bear
Tae streek him oot unner an auld creel-chair.
   To stretch him out under an old wicker-chair.
      To stiffen him out with a basket-chair.
And a kind o squoot
   And a kind of squoot
      And a sort of sqoze
He could dae wioot
   He could do without
      Which grows and grows
Is no that braw for his puir auld snoot;
   Is not that pleasant for his poor old snoot;
      Is not too nice for his poor old nose,
And a kind o squeed
   And a kind of squeed
      And a sort of squch
Is sair indeed
   Is grievous indeed
      Is much too much
On his mooth and his lugs and the back o his heid.
   On his mouth and his ears and the back of his head.
      For his neck and his mouth and his ears and such.

Plainly some of the changes are enforced by the rhyme, but I do think "I lay on ma wame / Some verse tae declaim / But naethin particular seemed to strike hame" is better, and better poetry, than "I lay on my tum / And I tried to hum / But nothing particular seemed to come."

I'll also mention here Kidnappit, a graphic novel based on Robert Louis Stevenson's classic romance Kidnapped, which is available both in English and in Scots. In the original book, much of the dialogue is Scots-and-water, whereas all the narration, though in first person by a Scotsman, is in standard English or nearly so. I don't know how the characters speak in the English comic, but some lines of dialogue can be usefully contrasted between the two versions that I have read: when our hero David Balfour confronts the Red Fox in Stevenson's original (chapter 17), he says: "I am neither of his people [the Stewarts] nor yours [the Campbells], but an honest subject of King George, owing no man and fearing no man." But in the Scots comic, what David says to the Reid Tod is "I am nae aucht of James's folk or o yours. I am a leal subject o King George, aucht a nae man and feart o nane." Somehow I think that's more likely to be what the real David Balfour (so to speak) actually said.

2011-01-21

The MicroLark parser

I've been developing a parser for MicroXML which I have dubbed MicroLark, in honor of Tim Bray's original 1998 XML parser Lark. I didn't take any code from Lark, but we ended up converging on similar ideas: it provides both push and tree parsers (as well as a pull parser), it is written in Java, and I intend to evolve it as MicroXML evolves. However, as MicroXML is much smaller than XML, so MicroLark is about a third the size of Lark.

If you want to cut to the chase now, you can get the jar file, the source code of version 0.8, the Javadoc, and the test cases (which are based on the W3C's XML Conformance Test Suites). If you run the jar file, pass it one argument which is the file you want to parse, and you'll get the parsed output in Pyx format, a simple line-oriented format similar to SGML ESIS format. If you specify @ as the file, MicroLark will read the names of files from the standard input, and write its output to files with the same names but with .pyx added.

MicroLark 0.8 supports MicroXML according to James's most recent definition, with the addition of prefixed attribute names. MicroLark allows, but does not require, namespace definitions for these prefixes. Element names do not allow prefixes. This version provides the parser proper, the Element class, and a MicroXML writer. I'll be adding more MicroXML-specific test cases in a later release. Future versions will supply a package of iterators to allow XPath-style operations, and a validator based on MicroRNG.

The core of MicroLark is a pull parser implemented as a state machine, which is just a big switch of switches. The outer switch is controlled by the current state, and the inner switches by the current character. The parser returns when a start-tag, end-tag, character data, end of document, or error is found, making sure that parsing can be resumed smoothly afterwards. Consequently, it is not draconian, though its error recovery strategy mostly consists of resetting the current state to "character data". When the parser returns, the caller can call accessors to get the current element stack or character data, or the location and text of an error. The push parser (SAX-style) and the tree parser (DOM-style) are thin layers over the pull parser; the tree parser is draconian.

The Element class is the incarnation of the MicroXML data model, so it provides access to the name, attributes, and children of an element. Element objects are provided by all three parsers, though only the tree parser populates the children. There are the usual collection methods for fetching, searching, and mutating the attributes and children; text children are represented as Strings, and when an attempt is made to insert a text child next to an existing text child, the two are coalesced. You can create your own Element objects, and the class ensures that they always represent well-formed MicroXML (for example, the names of elements and attributes must be well-formed). There are convenience methods for retrieving the current value of an inherited attribute, and for obtaining the current values of xml:lang/lang, xml:id/id, xml:base, and the namespace of a prefixed attribute.

Users can create subclasses of Element and instruct the parser to use them by creating a factory object. Factory objects get the current element stack from the parser as well as the name of the new element, and return an instance based on them. This allows tree nodes to have their own fields and methods suitable for their use in the application, as well as the creation of tree nodes that enforce restrictions on their children such as "no text children" or "element children must belong to class X".

MicroLark is open source, licensed under the Apache 2.0 license. Check it out, play with it, report bugs and suggestions for improvement in the comments here or to cowan@ccil.org

2011-01-18

Winnie-the-Pooh in Scots

I have just read, and with great enjoyment, Winnie-the-Pooh in Scots, translated by James Robertson and with the original E. H. Shepard illustrations.

Lang lang syne, a lang while syne noo, aboot Friday past, Winnie-the-Pooh steyed in a forest aw by himsel unner the name o Sanders.
('Whit does "unner the name" mean?' spiers Christopher Robin.
'It means he had the name in gowd letters ower the door and he steyed unner it.'
'Winnie-the-Pooh wisna jist shair,' says Christopher Robin.
'I am noo,' says a gurly [growly] voice.
'Then I'll cairry on," says I.)

The other characters, I should mention, are Wee Grumphie (Piglet), Heehaw (Eeyore, and a much more sensible spelling than "Eeyore" to an American like me), Rabbit, Hoolet (Owl, cf. English howlet, a mixture of howl and owlet, or their French originals), Kanga, and the Bairn Roo. Teeger doesn't come in until The Hoose at Pooh's Neuk, which has just arrived but I haven't read yet.

The book is a delight overall, especially to one as steeped in Pooh as I have always been, and who had Scott, RLS, and Ian Maclaren mixed in with his English reading from a very early age. I suppose there were half-a-dozen or so words that I didn't know the meanings of, as I unfortunately read the book without the marvelous Dictionar o the Scots Leid, the OED for Scots, at hand. I must particularly praise the verse translations, such as this one from Chapter 1, "Where-intil we are introduced tae Winnie-the-Pooh and a wheen Bees, and the Stories stert":

It's a thocht, is it no, that if Bears were Bees,
They'd bigg their bykes at the bottom o trees.
And if that wis the wey o it (if Bees were Bears)
We widna hae tae speel up aw thir stairs.

My favorite moment, however, was at this line in Chapter 4: "These notices had been written by Christopher Robin, who was the only one in the forest who could spell; for Owl, wise though he was in many ways, able to read and write and spell his own name WOL, yet somehow went all to pieces over delicate words like MEASLES and BUTTEREDTOAST." In Robertson's translation, this becomes: "Thae notices had been written oot by Christopher Robin, the ainly yin in the Forest that could spell; for, lang-heidit though Hoolet wis in mony weys, able tae read and write and spell his ain name HOOTEL, for some reason he gaed aw through-ither when it cam to kittle words like MEASLES and BUTTERYBREID."

The next time I'm writing some software that needs a fanciful name, perhaps I'll call it Hootel.

2011-01-01

MicroXML and JSON

Warning: You don't know about MicroXML without you have read a blog post by the name of "More on MicroXML"; but that ain't no matter, because you can click on the link and read all about it.

Warning Too: Carefully note the word "and" in the title. There's a reason why it's not "versus".

The whole point of MicroXML is to provide an XML spec (and associated data model) which is small and simple enough, and easy enough to implement, that it can go where no XML has gone before. Of course, JSON is already filling part of that niche, and it's even simpler than MicroXML. So MicroXMLers have two choices: think up reasons why JSON is bad, or figure out ways to coexist with it. My personality being what it is, I choose the second.

The goals of this posting are a) to specify a way to losslessly and uniquely transform JSON documents into MicroXML documents and back, and b) to specify a way to add markup to an arbitrary MicroXML document to explain how to transform it to JSON, which probably involves some amount of loss, because if MicroXML weren't more expressive than JSON, it wouldn't have a reason to exist. Consequently, a non-goal is to specify a way to losslessly and uniquely transform MicroXML to JSON and back.

JSON values have six possible types: objects (key-value mappings), arrays (ordered lists of values), strings, numbers, booleans, and null. The simplest approach to the first goal that could possibly work is to define a MicroXML vocabulary with six elements in it, named object, array, string, number, boolean, and null, and that's what I'm going to specify. So JSON converted to MicroXML looks pretty much like JSON itself, only more verbose. Why do this at all? So that the converted JSON can be fed into a MicroXML-based or XML-based pipeline and possibly converted back to JSON at the other end. Of course, if you don't need to do that, no problem: just don't convert to MicroXML in the first place.

Five of the six types are easy to represent: an array element represents the elements of the array using its child elements; a string, number, or boolean element contains the string, number or boolean value as character content, and a null element is always empty.

Next we must choose how to represent the key-value pairs within an object. They can't be represented as attributes (that is, with the key as the attribute name and the value as the attribute value), because the JSON RFC only says that keys SHOULD be unique, not that they MUST be unique, and attribute names in XML elements MUST be unique. So we'll represent each key-value pair as a child element, and represent the value of the pair using the content of the element.

But what about the key? There are two plausible choices: use an element with the fixed name pair and specify the key (which must be a string) using a key attribute, or use the name of the element directly as the key. The first solution is general but verbose; the second solution is not general, because only a subset of strings can appear as a MicroXML (or XML) element name. We'll require MicroXML-to-JSON converters to accept both (be liberal in what you accept), but require JSON-to-MicroXML converters to use the second solution unless the key contains a character that's not valid in XML names (be conservative in what you send). So pair becomes a seventh name in the MicroXML vocabulary for JSON.

(The characters U+FFFE and U+FFFF can appear literally in a JSON string, key, or value, but can't appear in XML character content, not even using character references. These aren't likely to actually occur in JSON documents, but just for completeness we'll say that they must be escaped with JSON escaping as \uFFFE and \uFFFF. This constitutes a minor violation of the rule of verbatim round-tripping, since JSON->MicroXML->JSON will always produce escape sequences for these characters even if the original document had them appear literally, but no realistic JSON application will notice the difference.)

So much for the first goal. What about the second? We'll require JSON->MicroXML translators to adopt the rules above to begin with. What about elements and attributes present in the MicroXML that have other names? We'll say that if an element has the attribute json-type, then the value of that attribute tells us how to process it. Thus an element named list with a json-type attribute of array will be converted to a JSON array. In this process, the actual name of the element and any inappropriate content is discarded, including any character content in an element with a json-type of object or array and any content at all of an element with a json-type of null. We don't discard child elements in elements with a json-type of string, number, or boolean: instead we use the XPath value of the element, which is the same as the content of the element with any tags ignored.

What about MicroXML attributes? We discard them for all elements except those with a json-type of object, where we treat them as additional key-value pairs (excepting of course any json-key attribute).

As usual, comments are solicited.

Recycled Knowledge