2011-01-21

The MicroLark parser

I've been developing a parser for MicroXML which I have dubbed MicroLark, in honor of Tim Bray's original 1998 XML parser Lark. I didn't take any code from Lark, but we ended up converging on similar ideas: it provides both push and tree parsers (as well as a pull parser), it is written in Java, and I intend to evolve it as MicroXML evolves. However, as MicroXML is much smaller than XML, so MicroLark is about a third the size of Lark.

If you want to cut to the chase now, you can get the jar file, the source code of version 0.8, the Javadoc, and the test cases (which are based on the W3C's XML Conformance Test Suites). If you run the jar file, pass it one argument which is the file you want to parse, and you'll get the parsed output in Pyx format, a simple line-oriented format similar to SGML ESIS format. If you specify @ as the file, MicroLark will read the names of files from the standard input, and write its output to files with the same names but with .pyx added.

MicroLark 0.8 supports MicroXML according to James's most recent definition, with the addition of prefixed attribute names. MicroLark allows, but does not require, namespace definitions for these prefixes. Element names do not allow prefixes. This version provides the parser proper, the Element class, and a MicroXML writer. I'll be adding more MicroXML-specific test cases in a later release. Future versions will supply a package of iterators to allow XPath-style operations, and a validator based on MicroRNG.

The core of MicroLark is a pull parser implemented as a state machine, which is just a big switch of switches. The outer switch is controlled by the current state, and the inner switches by the current character. The parser returns when a start-tag, end-tag, character data, end of document, or error is found, making sure that parsing can be resumed smoothly afterwards. Consequently, it is not draconian, though its error recovery strategy mostly consists of resetting the current state to "character data". When the parser returns, the caller can call accessors to get the current element stack or character data, or the location and text of an error. The push parser (SAX-style) and the tree parser (DOM-style) are thin layers over the pull parser; the tree parser is draconian.

The Element class is the incarnation of the MicroXML data model, so it provides access to the name, attributes, and children of an element. Element objects are provided by all three parsers, though only the tree parser populates the children. There are the usual collection methods for fetching, searching, and mutating the attributes and children; text children are represented as Strings, and when an attempt is made to insert a text child next to an existing text child, the two are coalesced. You can create your own Element objects, and the class ensures that they always represent well-formed MicroXML (for example, the names of elements and attributes must be well-formed). There are convenience methods for retrieving the current value of an inherited attribute, and for obtaining the current values of xml:lang/lang, xml:id/id, xml:base, and the namespace of a prefixed attribute.

Users can create subclasses of Element and instruct the parser to use them by creating a factory object. Factory objects get the current element stack from the parser as well as the name of the new element, and return an instance based on them. This allows tree nodes to have their own fields and methods suitable for their use in the application, as well as the creation of tree nodes that enforce restrictions on their children such as "no text children" or "element children must belong to class X".

MicroLark is open source, licensed under the Apache 2.0 license. Check it out, play with it, report bugs and suggestions for improvement in the comments here or to cowan@ccil.org

.

No comments: