2007-08-13

Extreme Markup 2007: Tuesday

This is a report on Extreme Markup 2007 for Tuesday.

Tuesday started, as is traditional, with a welcome from James David Mason, Steve Newcomb, and Michael Sperberg-McQueen, all luminaries of the XML community, followed by a brief practical introduction by Debby LaPeyre (where are the bathrooms? where is the coffee? who can help you?) and the keynote speech by Tommie Usdin. Debbie and Tommie have been running this conference for years, and are now doing so directly through their company, Mulberry Technologies, rather than via IDEAlliance -- a parting of the ways with clear benefits for both sides, I believe.

Tommie's talk, "Riding the wave, riding for a fall, or just along for the ride" was about a lot of things, but what drew my attention was the question of edge cases. 95% of all XML, we are told, is not written by human beings but generated as part of Web Services calls. And then again, depending on who you ask, 95% of XML is RSS feeds. We can expect the list of 95%s to grow in future. Or perhaps 95% is really human-authored content after all. Who knows?

Next, still in plenary session, Tom Passin presented on the practical use of RDF for system modeling. The interesting feature here for me was an indented-plain-text representation of RDF that can be rewritten by a simple Python program as RDF/XML. It takes advantage of the fact that people naturally write lists, naturally indent the subpoints (which can be modeled as RDF properties), and can easily extend the indentation to add values. Here's an example from his paper:

Organization::
   rdf:about::OFS
   rdfs:label::Office for Federated Systems
   member-of::
      rdf:resource::Federated-Umbrella-Organization

The final plenary session was Michael Kay, the author of the excellent XSLT processor Saxon, on optimizing XSLT using an XSLT transformation XSLT is written as a tree, and it's also an excellent language for transforming trees, and optimization is mostly about transforming one tree into another. For example, an expression of the form count(X) = 0 in XQuery can be rewritten as empty(X); it's a lot faster if you don't have to count thousands of Xs just to determine whether there are any. (Saxon treats XQuery as just another syntax for XSLT 2.0.)

Michael's approach involves having lots of simple rewrite rules of this type, and then iterating over the list until no more changes are being made. That's important, because one optimization may expose the possibility of another optimization being made. Sometimes it pays to recognize a general XSLT expression as a special case, and provide a processor-specific extension that provides that special case efficiently. For this reason, the rewrites are as a whole processor-specific. Michael talked in particular about the case of having lots of boolean-valued attributes on an element whose content model is known. Rewriting can cleverly pack those attributes into a single integer-valued attribute with special functions to test bit combinations.

Update: Michael also pointed out that XSLT optimizers don't know anything about the data, unlike the built-in SQL query optimizers that databases use, and so have to make guesses about how much of what you're going to find in any input document. Someone asked if he had made any explorations in that direction: he said not as yet.

In the question period, Michael said that Saxon is moving more and more to generating JVM/CLR bytecodes rather than interpreting the decorated tree at runtime. Some cases are still difficult to make fast in generated code: for example, given the path $x/a/b/c/d/e, if $x is known to be a singleton, the results will already be in document order, but if not, a sort must be performed. This could obviously be compiled into conditional code, but it's messy and can lead to a combinatorial explosion as it interacts with other features. Deciding at run time is just easier.

Most of the remaining papers were given two at a time, although recognizable tracks don't really exist at Extreme. I went to see David Lee of Epocrates on getting content authored in MS Word into appropriate XML. The core of this talk was an extended lament on how authors insist on using Word; even if you provide specialized authoring tools, they compose in Word and then cut and paste, more or less incorrectly, into the specialized tool. Epocrates has tried a variety of strategies: Word styles (authors won't use them), tagged sections (authors screw them up), form fields (plaintext only, so authors delete them and type in rich text instead). In the end, they adopted Word tables as the safest and least corruptible approach. A few Word macros provide useful validations, and when the document is complete, a Word 2003 macro rewrites it using Word 2003 XML (unless it is already in that format). I pointed out that the approach of having authors use Word and saving in plain text was also viable, leaving all markup to be added by automated downstream procssing; David said that design was too simple for the complex documents his authors were creating.

Next I heard Michael Sperberg-McQueen on his particular flavor of overlap; I have nothing in particular to add to his paper, which is extremely accessible.

As a RELAX NG geek, I was interested to hear about Relaxed, a completely new implementation of RELAX NG and Schematron validation. Update: I forgot to mention that it also provides NVDL (which accounts for the name), a meta-schema language which dissects a document by namespace, local root element, or both, and passes different fragments of the document to different validators using different schemas. JNVDL uses the Sun Multi-Schema Validator to translate any other schema language you can think of into RELAX NG. You can try the validator online.

I should mention that this conference deliberately provides long breaks, a long lunch, and encourages in-the-halls discussion, one of the most important parts (some say the only important part) of any conference. So if I didn't attend the maximum possible number of talks, there were reasons for that. (Sometimes you just need to catch up on your email, too.)

5 comments:

Ed Davies said...

indented-plain-text representation of RDF

Apart from syntactic details (about which wars may be fought) what distinguishes this from the likes of N3 or Turtle?

John Cowan said...

N3 is a flat representation: each triple is independent of all the rest syntactically. Tom's indented representation has the same striping property as XML/RDF; the object of an RDF claim (indented twice) can have its own predicates and objects attached.

Ed Davies said...

You're thinking of N-Triples. N3 and Turtle are much richer (though supersets of N-Triples). Here's an example from the N3 Primer:

<#pat> <#child> <#al>, <#chaz>, <#mo> ;
<#age> 24 ;
<#eyecolor> "blue" .

This looks better in a wider display area with proper use of the indentation.

N3 and Turtle have a lot in common. They both use commas to separate lists of objects with the same subject and predicate and semicolons to separate predicate/object list pairs with the same subject.

John Cowan said...

Well, your grasp of N3 is far greater than mine, but this just looks like eliding the subject when there are several triples with a shared subject. The use case for indentation (as for XML/RDF striping) is that you want to add secondary triples that say things about the object of a primary triple: John flew-to New-York [which is] located-in the-U.S.

ed davies said...

Ah yes, you're right. Turtle and, I think, N3 only allow multiple layers like that when the intermediate objects are blank nodes. E.g.,

x:John x:flew-to [ a x:City; x:name "New York"; x:in [ a x:Country; x:name "USA" ] ]

introducing two blank nodes: one for the city and one for the country.

Pity they seem to have used up most of the types of brackets which could sensibly be used to add a notation to express more layers. (N3 uses braces which Turtle doesn't).