2008-10-14

Converting Restricted XML to Good-Quality JSON

Here's some ideas for converting restricted forms of XML to good-quality JSON. The restrictions are as follows:
  • The XML can't contain mixed content (elements with both children/attributes and text).
  • The XML cannot depend on the order of child elements with distinct names (order dependence in children with the same name is okay).
  • There can't be any attributes with the same name as child elements.
  • There can't be any elements or attributes that differ only in their namespace names.
You also need to know the following things for each child element:
  • Whether it MUST appear at most once (a singleton element) or MAY appear more than once (a multiplex element).
  • Whether it only contains text (an element with simple type) or child elements and/or attributes (an element with complex-type).
Now, to convert the XML to JSON, apply these rules recursively:
  • A singleton element of simple type, and likewise an attribute, is converted to a JSON simple value: a number or boolean if syntactically possible, otherwise a string.
  • A multiplex object of simple type is converted to a JSON array of simple values.
  • A singleton element of complex type is converted to a JSON object that maps the local names of child elements and attributes to their content. Namespace names are discarded.
  • A multiplex element of complex type is mapped to a JSON array of JSON objects that map the local names of child elements and attributes to their content. Namespace names are discarded.
Comments are very welcome.

2008-09-17

I before E except after C

Here's a better version of the little poem. I don't know who wrote it; I touched it up a bit for better rhythm:

When IE and EI both say EE,
Who can tell which it should be?
After C, use E then I;
Otherwise IE will apply.
Some exceptions we may note
Which one needs to learn by rote:
Protein, caffeine, weird, and seize,
And in the U.S., leisure, please.

2008-07-01

2008-06-18

Dorian Sion Cowan

My grandson Dorian was born at 9:08 PM yesterday, June 17, 2008 (New York time). He weighed 9 lb 0.9 oz (4110 g) at birth, and was 22 inches (56 cm) long. And he is the Best Baby In The World.

(Well, when I say that, I make a mental reservation in favor of Irene, Dorian's mommy, who is now almost 21 but was certainly the Best Baby in her day.)

Baby and mother are doing wonderfully well -- Dorian is starting to breastfeed very nicely, and already knows a great many Proto-Indo-European roots. Irene's Caesarean incision is still very sore, and the IV is in her hand, not her arm, which makes handling him a little awkward for her. Her best friends have been hovering around the two of them, and so have Gale and I as far as we have been able. They will be coming home Friday morning.

Anyhow, I sang him a lullaby the night he was born, not that he needed it -- he was pretty well drifting off anyhow. But even though my voice was cracking, I needed to sing it to him. It's by Fred Small, and is called "Everything Possible". This is the slightly altered version of the chorus that Dorian actually got:

You can be anybody you want to be,
You can love whomever you will.
You can travel any country where your heart leads,
And know I will love you still.
You can live by yourself, you can gather friends around,
Or find one special one,
And the only measure of your words and your deeds
Is the love you leave behind you when you're gone.

And this is the second song he heard from me, this morning when I stopped by to see him:

Rockabye Dorian, on the tree-top
When you are fed, your poop will go plop
When you have plopped, your diaper we'll change
And then you'll be cleaned up and happy again.

Okay, it doesn't quite rhyme, but it's his.

Dorian, if you are reading this, you already know your grandfather is a crazy old man who embarrasses the hell out of people. You'll live this one down too.

2008-05-04

Essentialist Explanations, 14th edition

As always, posted here. We are getting close to 1000 entries -- keep them coming in!

2008-04-25

Eulogy

The following was said of David Ricardo by Maria Edgeworth:

I never argued or discussed a question with any person who argues more fairly, or less for victory and more for truth. He gives full weight to every argument brought against him, and seems not to be on any side of the question for one instant longer than the conviction of his mind is on that side. It seems quite indifferent to him whether you find the truth or whether he finds it, provided it be found.

Or more concisely: He wanted to be right, whether or not he had been right.

Ricardo died at fifty-one. I myself am almost fifty, and if I were to die next year, I hope as much could truthfully be said of me.

2008-03-19

On the word "bumblebee"

The story of the word bumblebee is curious, but (contra Mr. Burns of the Simpsons) certainly doesn't lead back to a form like bumbled bee, in the way that ice cream leads back to iced cream, or the American form skim milk descends from the form skimmed milk still current elsewhere. The bee part is transparent, and there is a Middle English verb bomb(e)len, meaning to make a humming sound, presumably of imitative origin. So there you are.

However, it's clear that the older form was humble-bee, where hum(b)le is an intensive of hum, which is also presumably of imitative origin. Whether bumblebee is a new coinage based on bombelen, or whether it is an alteration of humble-bee by dissimilation, or a mixture of both, it's impossible to say.

But when we look in Pokorny's etymological dictionary of Indo-European for hum, we see it under the root kem²-, as expected by Grimm's Law, and with Lithuanian reflexes in k- and Slavic ones in ch- that also refer to humming noises and bees. That certainly does not sound imitative to me -- the sharp sound of [k] is nothing like a bee hum, which has no beginning and no end. So in the end the obvious imitative nature of bumblebee leads to a riddle wrapped in a mystery inside an enigma.

And there remains at least one dangling oddity: Pokorny also lists an Old Persian -- at least I think that's what "Ai." means -- reflex meaning "yak". Yaks grunt (as the Linnaean name Bos grunniens indicates), they don't hum, and what is Old Persian doing with an inherited word for "yak" anyhow? English, like most modern languages, has borrowed its word from Tibetan.

2008-03-03

Elements or attributes?

Here's my contribution to the "elements vs. attributes" debate:

General points:

  1. Attributes are more restrictive than elements, and all designs have some elements, so an all-element design is simplest -- which is not the same as best.

  2. In a tree-style data model, elements are typically represented internally as nodes, which use more memory than the strings used to represent attributes. Sometimes the nodes are of different application-specific classes, which in many languages also takes up memory to represent the classes.

  3. When streaming, elements are processed one at a time (possibly even piece by piece, depending on the XML parser you are using), whereas all the attributes of an element and their values are reported at once, which costs memory, particularly if some attribute values are very long.

  4. Both element content and attribute values need to be escaped, so escaping should not be a consideration in the design.

  5. In some programming languages and libraries, processing elements is easier; in others, processing attributes is easier. Beware of using ease of processing as a criterion. In particular, XSLT can handle either with equal facility.

  6. If a piece of data should usually be shown to the user, use an element; if not, use an attribute. (This rule is often violated for one reason or another.)

  7. If you are extending an existing schema, do things by analogy to how things are done in that schema.

  8. Sensible schema languages, meaning RELAX NG, treat elements and attributes symmetrically. Older and cruder schema languages tend to have better support for elements.

Using elements:

  1. If something might appear more than once in a data model, use an element rather than introducing attributes with names like part1, part2, part3 ....

  2. If order matters between two pieces of data, use elements for them: attributes are inherently unordered.

  3. If a piece of data has, or might have, its own substructure, use it in an element: getting substructure into an attribute is always messy. Similarly, if the data is a constituent part of some larger piece of data, put it in an element.

  4. An exception to the previous rule: multiple whitespace-separated tokens can safely be put in an attribute. In principle, the separator can be anything, but schema-language validators are currently only able to handle whitespace, so it's best to stick with that.

  5. If a piece of data extends across multiple lines, use an element: XML parsers will change newlines in attribute values into spaces.

  6. If a piece of data is in a natural language, put it in an element so you can use the xml:lang attribute to label the language being used. Some kinds of natural-language text, like Japanese, also require annotations that are conventionally represented using child elements; right-to-left languages like Hebrew and Arabic may similarly require child elements to manage bidirectionality properly.

Using attributes:

  1. If the data is a code from an enumeration, code list, or controlled vocabulary, put it in an attribute if possible. For example, language tags, currency codes, medical diagnostic codes, etc. are best handled as attributes.

  2. If a piece of data is really metadata on some other piece of data (for example, representing a class or role that the main data serves, or specifying a method of processing it), put it in an attribute if possible.

  3. In particular, if a piece of data is an ID (either a label or a reference to a label elsewhere in the document) for some other piece of data, put the identifying piece in an attribute. When it's a label, use the name xml:id for the attribute.

  4. Hypertext references (hrefs) are conventionally put in attributes.

  5. If a piece of data is applicable to an element and any descendant elements unless it is overridden in some of them, it is conventional to put it in an attribute. Well-known examples are xml:lang, xml:space, xml:base, and namespace declarations.

  6. If terseness is really the most important thing, use attributes, but consider gzip compression instead -- it works very well on documents with highly repetitive structures.

Michael Kay says:

Beginners always ask this question.
Those with a little experience express their opinions passionately.
Experts tell you there is no right answer.

I say:

Newbies always ask:
     "Elements or attributes?
Which will serve me best?"
     Those who know roar like lions;
     Wise hackers smile like tigers.
          --a tanka, or extended haiku

Final words:

Break any or all of these rules rather than create a crude, arbitrary, disgusting mess of a design if that's what following them slavishly would give you. In particular, random mixtures of attributes and child elements are hard to follow and hard to use, though it often makes good sense to use both when the data clearly fall into two different groups such as simple/complex or metadata/data.


2008-02-07

Which characters are excluded in XML 5th Edition names?

The list of allowed name characters in the XML 1.0 Fifth Edition looks pretty miscellaneous. The clue to what's really going on is that unlike the rule of earlier XML 1.0 versions, where everything not permitted was forbidden, now everything that is not forbidden is permitted. (I emphasize that this is only about name characters: every character is and always has been permitted in running text and attribute values except the ASCII controls.)

So what's forbidden, and why?

  • The ASCII control characters and their 8-bit counterparts. Obviously.
  • The ASCII and Latin-1 symbolic characters, with the exceptions of hyphen, period, colon, underscore, and middle dot, which have always been permitted in XML names. These characters are commonly used as syntax delimiters either in XML itself or in other languages, and so are excluded.
  • The Greek question mark, which looks like a semicolon and is canonically equivalent to a regular semicolon.
  • The General Punctuation block of Unicode, with the exceptions of the zero-width joiner, zero-width non-joiner, undertie, and character-tie characters, which are required in certain languages to spell words correctly. Various kinds of blank spaces and assorted punctuation don't make sense in names.
  • The various Unicode symbols blocks reserved for "pattern syntax", from U+2190 to U+2BFF. These characters should never appear in identifiers of any sort, as they are reserved for use as syntactic delimiters in future languages that exploit non-ASCII syntax. Many are assigned, some are not.
  • The Ideographic Description Characters block, which is used to describe (not create) uncoded Chinese characters.
  • The surrogate code units (which don't correspond to Unicode characters anyhow) and private-use characters. Using the latter, in names or otherwise, is very bad for interoperability.
  • The Plane 0 non-characters at U+FDD0 to U+FDEF, U+FFFE, and U+FFFF. The non-characters on the other planes are allowed, not because they are a good idea, but to simplify implementation.

Note that the undertie and character tie, the European digits 0-9, and the diacritics in the Combining Characters block are not permitted at the start of a name. Other characters could have sensibly been excluded, particularly combining characters that don't happen to be in the Combining Characters block, but it simplifies implementation to permit them.

This list is intentionally sparse. The new Appendix J gives a simplified set of non-binding suggestions for choosing names that are actually sensible.

2008-02-06

Who do I work for?

Well, a company that provides an email service with about 107 users, and a calendar service with about 106 users, and a news syndicate with about 104 sources, and a video sharing facility that displays about 108 video views a day, and an image index with about 109 images. And it connects about 105 advertisers with about 105 online publishers and 103 offline ones, and provides online wallets for about 106 buyers and 105 sellers, and is localized in about 102 interface languages, and employs about 104 people, and is rated 100 in the list of best companies to work for. And it is not best known for any of these things.

Who are they?

10100.

Justice at last, part two

The Fifth Edition of XML 1.0 is now a Proposed Edited Recommendation.

So what, you say. Ho hum, you say. A bunch of errata folded in to a new edition, you say. No real change here, you say.

But no, not at all, but quite otherwise. There's a big change here, assuming this PER gets past the W3C membership vote and becomes a full W3C Recommendation. There's something happening here, and what it is is eminently clear.

Justice is coming at last to XML 1.0.

For a long time, the characters used in the markup of an XML document -- element names, attribute names, processing instruction targets, and so on -- have been limited to those that were allowed in Unicode 2.0, which was issued in July 1996. If you wanted your element names in English, or French, or Arabic, or Hindi, or Mandarin Chinese, all was good. But if you wanted them in the national languages of Sri Lanka, or Eritrea, or Cambodia, or in Cantonese Chinese, to say nothing of lots and lots of minority languages, you were simply out of luck -- forever.

Not fair, people.

I tried fixing this the right way, by pushing the XML Core WG of the W3C to issue XML 1.1. It acquired some additional cruft along the way, some good, some in hindsight bad. It was roundly booed and even more roundly ignored. In particular, at least one 800-pound gorilla voted against it at W3C and refused to implement it.

Now it's being done the wrong way. We are simply extending the set of legal name characters to almost every Unicode character, relying on document authors and schema authors not to be idiots about it. Is that an incompatible change to XML 1.0 well-formedness? Hell yes. Is any existing XML 1.0 document going to become not well-formed? Hell no. We learned our lesson on that one.

Who supports this? I won't name names, but XML parser authors and distributors from gorillas to gibbons have been consulted in advance this time, and there are no screaming objections. Some will probably provide an option to turn Fifth Edition support on, others will turn it on by default. Unlike XML 1.1 support, this is actually a simplification: the big table of legal characters in Appendix B just isn't needed any more.

"Hot diggity (or however you say that in Amharic). When can I start using this?" Not so fast. First the W3C has to vote it in -- if they don't, all bets are off. Then implementations have to spread through the XML ecosystem, including not only development but deployment. It'll take years. But it only has to be done once, for all the writing systems that aren't in Unicode yet will all Just Work when they do get implemented.

Ask not what you can do for XML, but what XML can do for you.

It's morning in the world.

(Oh yes: Send comments before 16 May 2008 to xml-editor@w3.org.)

Justice at last

There was an old man from Nantucket
Who kept all his cash in a bucket.
   But his daughter Nan
   Ran away with a man
And as for the bucket, Nantucket.

The pair of them went to Manhasset,
The man and his Nan with the asset.
   Pa followed them there
   But they left in a tear
And as for the asset, Manhasset.

He followed them next to Pawtucket,
Nan and her man and the bucket.
   Pa said to the man,
   "You can keep my sweet Nan",
But as for the bucket, Pawtucket.

(This works best if you pronounce "Pa" as "paw", assuming you make any difference between the two -- in New England, there definitely is. If your "aw" sounds like "ah", you can hear the "aw" sound the rest of us use by saying "Awwwwwwwwww!")

Here's the trio's route:
   note the doubling back
      to avoid pursuit.


View Larger Map

2008-01-31

Taggle, a TagSoup in C++, available now

A company called JezUK has released Taggle, which is a straight port of TagSoup 1.2 to C++. It's a part of Arabica, a C++ XML toolkit providing SAX, DOM, XPath, and partial XSLT. I have no connection with JezUK (except apparently as source of inspiration).

The author says the code is alpha-quality now, so he'd appreciate lots of testers to shake out bugs. C++ users, go to it! Having a C++ port will be a real enhancement for TagSoup.

The code is currently in public Subversion: you can fetch it with svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port.

2008-01-10

Revised home page

I've rewritten my home page at http://www.ccil.org/~cowan. Some interesting old things that were on my site but had no pointers from there now have little writeups, and I've reorganized it a bit -- but it's still the ultimate minimalist home page, no pictures or graphics.

2008-01-05

TagSoup 1.2 released at long last

There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the CHANGES file in the source distribution. Very special thanks to Jojo Dijamco, whose intensive efforts at debugging made this release a usable upgrade rather than a useless mass of undetected bugs.

  • As noted above, I have changed the license to Apache 2.0.

  • The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with parser.setFeature(Parser.emptyBogonsFeature, true).

  • The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like foo?cdown=32&cup=42 are no longer seen as containing an instance of the ∪ character (whose name happens to be cup).

  • Several new switches have been added:

    • --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers.

    • --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.)

    • --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML).

  • The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.

  • The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes.
  • All bugs around newline processing on Windows should now be gone.

  • A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.

  • Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.

  • The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp.

  • In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked".

  • Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.

Download the TagSoup 1.2 jar file here. It's about 87K long.
Download the full TagSoup 1.2 source here. If you don't have zip, you can use jar to unpack it.
Download the current CHANGES file here.

2007-08-19

Down vs. across

This turn-of-the-eighteenth-century poem reads one way down, another way across. The "down" version was politically orthodox back in the reign of George I, whereas the "across" version represented treasonous Jacobite sympathies.

I love with all my heart The Tory party here
The Hanoverian part Most hateful doth appear
And for their settlement I ever have denied
My conscience gives consent To be on James's side
Most glorious is the cause To be with such a king
To fight for George's laws Will Britain's ruin bring
This is my mind and heart In this opinion I
Though none should take my part    Resolve to live and die

2007-08-17

Third normal form for classes

It's been wisely and wittily said, though I don't know who by, that a relation is in third normal form (3NF) when all its fields depend on "the key, the whole key, and nothing but the key". This is generally considered to be a Good Thing, though people do deviate from it for the sake of performance (or what they think performance will be -- but that's a whole different rant).

I'd like to introduce an analogous notion of 3NF for classes in object-oriented programming. A class is in 3NF when its public methods depend on the state, the whole state, and nothing but the state. By state here I mean all the private instance variables of the class, without regard to whether they are mutable or not. A public method depends on the state if it either directly refers to an instance variable, or else invokes a private method that depends on the state.

So what do I mean, "the state, the whole state, and nothing but the state"? Three things:

  • If a method doesn't depend on the state at all, it shouldn't be a method of the class. It should be placed in a utility class, or (in C++) outside the class altogether, or at the very least marked as a utility method. It's really a customer of the class, and leaving it out of the class improves encapsulation. (Scott Meyers makes this point in Item 23 of Effective C++, Third Edition.)

  • Furthermore, if the state can be partitioned into two non-overlapping sub-states such that no methods depend on both of them, then the class should be refactored into two classes with separate states. This also improves encapsulation, as the methods in one class can now be changed without regard to the internals of the other class.

  • Finally, if the behavior of a method depends on something outside the state, encapsulation is once again broken — from the other direction this time. Such a method is difficult to test, since you cannot know what parts of what classes it depends on except by close examination.

At any rate, this is my current understanding. My Celebes Kalossi model grew out of considering how methods and state belong together, and this is the practical fruit of it.

Update: I didn't talk about protected methods, protected instance variables, or subclassing. The subclasses of a class are different from its customers, and need to be considered separately if any protected methods or state exist. I am a firm believer in "design for subclassing or forbid it": if you follow the rules above, then instead of subclassing a class, you can simply replace it with a work-alike that has different state while taking no risks of breaking it. (You probably need to make the original class implement some interface.)

Furthermore, the static methods of a class have static state, and the same analysis needs to be performed with respect to them.

Comments?

2007-08-14

Extreme Markup 2007: Friday

This is a report on Extreme Markup 2007 for Friday.

Friday was a half-day, but Extreme 2007 saved the best of all its many excellent papers for almost the last. Why is it that we repeat the "content over presentation" mantra so incessantly, but throw up our hands when it comes to tables? All the standard table models -- HTML, TEI, CALS -- are entirely presentational: a table is a sequence of rows, each of which is a sequence of cells. There are ways to give each column a label, but row labels are just the leftmost cell, perhaps identified presentationally but certainly not in a semantic way. If have a table of sales by region in various years, pulling out North American sales for 2005 with XPath is no trivial matter. Why must this be? Inquiring minds (notably David Birnbaum's) want to know.

David's proposal, really more of a meta-proposal, is to use semantic markup throughout. Mark up the table using an element expressing the sort of data, perhaps salesdata. Provide child elements naming the regions and years, and grandchildren showing the legal values of each. Then each cell in the table is simply a sales element whose value is the sales data, and whose attributes signify the region and year that it is data for. That's easy to query, and not particularly hard to XSLT into HTML either. (First use of the verb to XSLT? Probably not.)

Of course, there is no reason to stop at two categories. You can have a cube, or a hypercube, or as many dimensions as you want. The OLAP community knows all about the n-cubes you get from data mining, and the Lotus Improv tradition, now proundly carried on at quantrix.com (insert plug from highly satisfied customer here) has always been about presenting such n-cubes cleverly as spreadsheets.

The conference proper ended in plenary session with work by Henry Thompson of W3C that extends my own TagSoup project, which provides a stream of SAX events based on nasty, ugly HTML, to have a different back end. TagSoup is divided into a lexical scanner for HTML, which uses a state machine driven by a private markup language, and a rectifier, which is responsible for outputting the right SAX events in the right order. It's a constraint on TagSoup that although it can retain tags, it always pushes character content through right away, so it is streaming (modulo the possibility of a very large start-tag).

Henry's PyXup replaces the TagSoup rectifier, which uses another small markup language specifying the characteristics of elements, with his own rectifier using a pattern-action language. In TagSoup, you can say what kind of element an element is and something about its content model, but you can't specify the actions to be taken without modifying the code. (The same is technically true of the lexical scanner, but the list of existing actions is pretty generic.) In PyXup, you can specify actions to take, some of which take arguments which can be either constants or parts of the input stream matched by the associated pattern. This is a huge win, and I wish I'd thought of it when designing TagSoup. The question period involved both Henry and I giving answers to lots of the questions.

To wrap up we had, as we always have (it's a tradition) a talk by Michael Sperberg-McQueen. These talks are not recorded, nor are any notes or slides published, still less a full paper. You just hafta be there to appreciate the full (comedic and serious) impact. The title of this one was "Topic maps, RDF, and mushroom lasagne"; the relevance of the last item was that if you provide two lasagnas labeled "Meat" and "Vegetarian", most people avoid the latter, but if it's labeled "Mushroom" (or some other non-privative word) instead, it tends to get eaten about as much as the meat item). The talk was, if I have to nutshell it, about the importance of doing things rather than fighting about how to do them. At least that was my take-away; probably everyone in the room had a different view of the real meaning.

That's all, folks. Hope to see you in Montreal next August at Extreme Markup 2008.

2007-08-13

Extreme Markup 2007: Thursday

This is a report on Extreme Markup 2007 for Thursday.

My first talk of the day was on a fast XSLT processor written in C++ and designed for large documents (more than 2 gigs) and high performance. Documents are loaded into memory using event records, which are essentially just reifications of SAX events, with parent and next-sibling links. Because internal links are record numbers rather than raw pointers, the internal limit is 2 GB of events rather than bytes (I think that's the reason). The authors did an interesting experiment with compiling multiple XPaths into a single DFA and executing them in parallel, but found that overall performance was not really improved by this additional complexity.

The next presentation was by Eric Freese, about chatbots and RDF. The chatbot here uses AIML, a language for representing chat scripts, and enhances it by allowing RDF triples (notably Dublin Core and FOAF information) to be stuffed into the bot so it knows a lot more about a lot more. The bot, called ALICE, follows the classical Eliza/Doctor tradition; it applies rewrite rules to the input to produce its output. (Oddly, Eric had believed that Eliza was hardcoded rather than script-driven until I pointed out otherwise during the comment period; to watch the Doctor script in action, fire up emacs and type "Alt-x doctor", or type "Alt-x psychoanalyze-pinhead" and hit Ctrl-G after a few seconds.)

James David Mason told us all about the application of topic maps to very complex systems, in this case pipe organs. Organs are even more complicated than they appear to be from the outside. (The reviewers, we were told, tended to say things like "I wouldn't attend this talk because I'm not interested in organs"; I think they missed the point. There were, however, wonderful multimedia interruptions of the talk in the form of excerpts from music being played on the organs under discussion. Usually James tells us censored versions of stories from Y-12, the secret national laboratory at Oak Ridge, Tennessee. (They're starting to make things with FOGBANK again after many years of not using it, and it's a big concern.) This time he talked about his non-work instead of non-talking about his work; it was great.

Next I heard some folks from Portugal talking about digital preservation of the Portuguese national and local archives. They use PDF/A to preserve documents of all sorts, but they also need to preserve the contents of relational databases, with tables, views, triggers, stored procedures, and all. Dumping them as SQL has serious portability problems, so they decided to carefully define a database markup language to handle all the metadata as well as the data itself. That way, information can be authentically transferred from one database engine to another without loss.

The last (and plenary) presentation of the day had not a speaker but "listeners" from W3C, OASIS, and ISO, asking what things should be standardized in addition to what we already have. Audience members concentrated on processes for stabilizing and withdrawing standards (ISO has this, the other two don't), the desire for more and earlier feedback processes and fewer faits accomplis, and other meta-issues. There were a few requests for additional concrete standards; unfortunately, I don't have notes. The session ended without too many real fights.

On Wednesday I talked about the nocturne on naming, but it was actually held on Thursday. Much of it was spent discussing the FRBR model of works, expressions, manifestations, and items. For computer-document purposes, a work is the abstract work, e.g. Hamlet; an expression is a particular kind realization, like a particular edition of the text or recording of a performance; a manifestation is the expression in a particular format such as .txt, .html, .doc, or .avi; and an item is a particular copy residing on a particular server, disk, or tape. Ideally, there should be separate URIs for each of these things.

A commenter asked what about topic maps I was defending: primarily the clear topic-map distinction between a URI used to name a document (a subject locator in topic maps jargon) and the same URI used to refer to the subject matter of the document (a subject indicator). Theoretically, RDF uses different URIs for these separate purposes, but that leads to paradoxes -- the document you get when you GET a subject indicator, what's the URI that identifies that? Furthermore, how do you tell which is which? In the topic-maps data model, a topic simply has some URIs which are locators and others which are indicators.

Extreme Markup 2007: Wednesday

This is a report on Extreme Markup 2007 for Wednesday.

The first talk of the morning was by David Dubin, and was about an alternative approach to reifying RDF statements so that one can make RDF claims about existing RDF statements (such as who made them, and where and when, and whether and how much you should believe them). The classical approach is to model the statement as a node, and then assert Subject, Predicate, and Object properties about this node. David's approach involves using the XML/RDF for the claim itself as the object of RDF claims. I can't say I understood what he was driving at very well: it seems to me that the main deficiency with RDF that makes reification necessary is that you can't state an RDF sentence without also claiming that it is true. This is convenient in simple cases, but annoying when you want to do meta-RDF.

Paolo Marinelli analyzed alternative approaches to streaming validation. W3C XML Schema provides what he calls a STEVE streaming discipline: at the Start tag, you know the Type of an element, and at the End tag, you can make a Validity Evaluation. XML Schema 1.1 proposes to provide various kinds of conditional validation using a subset of XPath 1.0, but does not (in the current draft) provide the full power of what is actually streamable in XPath.

The core of this paper is the classification of XPath expressions into various axes and operations, specifying when you can determine the value of the expression and at what memory cost ("constant" and "linear-depth" mean streamable, "linear-size" means not streamable). See Table 1 in the paper for the details.

Finally, Paolo proposes an extended streamability model called LATVIA, in which there are no restrictions on such XPaths, and schemas are marked streamable or non-streamable by their authors (or their authors' tools, more likely). The difficulty here is that implementors' tools that depend on knowing element types early will wind up being unable to process certain schemas, which will result in a process of negotiation: "Can't you make this streamable?" "Well, no, because ..." "That will cost you one billion stars."

Moody Altamimi gave a superb talk on normalizing mathematical formulae. There are two flavors of MathML, Presentation and Content; the former is about layout (it's close to LaTeX conceptually), the latter is meant to express meaning, and is closer to Lisp. Even in Content MathML, there will be lots of false negatives because of non-significant differences in the way authors express things: it's all one whether you say an integration is from m to n, or whether it's over the interval [m,n], but Content MathML uses two different notations for this. Similarly, we can presume that + represents an associative operation, and do some transforms to unify (+ a b c), (+ (+ a b) c), and (+ a (+ b c)). On the other hand, we don't want to go overboard and unify (+ a 0) with a; if the author wrote "a + 0", presumably there was a good reason.

The next talk I attended was about hierarchical processing in SQL database engines, and was the worst presentation of the conference: it was a marketing presentation rather than a technical one ("their products bad, our product good"), and furthermore it was a marketing presentation that was all technical details, and as such exceedingly boring. Furthermore, it assumed a detailed ANSI SQL background rather than an XML one. I'll read the paper, because I'm interested in the subject, but I'm not very hopeful.

Liam Quin of the W3C told us all about XQuery implementations, and which ones support what and how well, and what kinds of reasons you'd have for choosing one over the other, all without making any specific recommendations himself. He said that in general conformance was good across the major implementations, and performance depended too heavily on the specifics of the query to generalize.

At this point I got pretty burned out and skipped all the remaining regular talks for the day, except for Ann Wrightson's excellent overview of model-driven XML, why the instance documents are practically unreadable (too high a level of meta-ness, basically), and what can be done (not much, short of defining local views that map into the fully general modeled versions). I spent a fair amount of time in the poster room, though I didn't take notes. I also attended the nocturne that evening on URIs and names, where I defended the Topic Maps view of things with vim, vigor, and vitality.