Down vs. across

This turn-of-the-eighteenth-century poem reads one way down, another way across. The "down" version was politically orthodox back in the reign of George I, whereas the "across" version represented treasonous Jacobite sympathies.

I love with all my heart The Tory party here
The Hanoverian part Most hateful doth appear
And for their settlement I ever have denied
My conscience gives consent To be on James's side
Most glorious is the cause To be with such a king
To fight for George's laws Will Britain's ruin bring
This is my mind and heart In this opinion I
Though none should take my part    Resolve to live and die


Third normal form for classes

It's been wisely and wittily said, though I don't know who by, that a relation is in third normal form (3NF) when all its fields depend on "the key, the whole key, and nothing but the key". This is generally considered to be a Good Thing, though people do deviate from it for the sake of performance (or what they think performance will be -- but that's a whole different rant).

I'd like to introduce an analogous notion of 3NF for classes in object-oriented programming. A class is in 3NF when its public methods depend on the state, the whole state, and nothing but the state. By state here I mean all the private instance variables of the class, without regard to whether they are mutable or not. A public method depends on the state if it either directly refers to an instance variable, or else invokes a private method that depends on the state.

So what do I mean, "the state, the whole state, and nothing but the state"? Three things:

  • If a method doesn't depend on the state at all, it shouldn't be a method of the class. It should be placed in a utility class, or (in C++) outside the class altogether, or at the very least marked as a utility method. It's really a customer of the class, and leaving it out of the class improves encapsulation. (Scott Meyers makes this point in Item 23 of Effective C++, Third Edition.)

  • Furthermore, if the state can be partitioned into two non-overlapping sub-states such that no methods depend on both of them, then the class should be refactored into two classes with separate states. This also improves encapsulation, as the methods in one class can now be changed without regard to the internals of the other class.

  • Finally, if the behavior of a method depends on something outside the state, encapsulation is once again broken — from the other direction this time. Such a method is difficult to test, since you cannot know what parts of what classes it depends on except by close examination.

At any rate, this is my current understanding. My Celebes Kalossi model grew out of considering how methods and state belong together, and this is the practical fruit of it.

Update: I didn't talk about protected methods, protected instance variables, or subclassing. The subclasses of a class are different from its customers, and need to be considered separately if any protected methods or state exist. I am a firm believer in "design for subclassing or forbid it": if you follow the rules above, then instead of subclassing a class, you can simply replace it with a work-alike that has different state while taking no risks of breaking it. (You probably need to make the original class implement some interface.)

Furthermore, the static methods of a class have static state, and the same analysis needs to be performed with respect to them.



Extreme Markup 2007: Friday

This is a report on Extreme Markup 2007 for Friday.

Friday was a half-day, but Extreme 2007 saved the best of all its many excellent papers for almost the last. Why is it that we repeat the "content over presentation" mantra so incessantly, but throw up our hands when it comes to tables? All the standard table models -- HTML, TEI, CALS -- are entirely presentational: a table is a sequence of rows, each of which is a sequence of cells. There are ways to give each column a label, but row labels are just the leftmost cell, perhaps identified presentationally but certainly not in a semantic way. If have a table of sales by region in various years, pulling out North American sales for 2005 with XPath is no trivial matter. Why must this be? Inquiring minds (notably David Birnbaum's) want to know.

David's proposal, really more of a meta-proposal, is to use semantic markup throughout. Mark up the table using an element expressing the sort of data, perhaps salesdata. Provide child elements naming the regions and years, and grandchildren showing the legal values of each. Then each cell in the table is simply a sales element whose value is the sales data, and whose attributes signify the region and year that it is data for. That's easy to query, and not particularly hard to XSLT into HTML either. (First use of the verb to XSLT? Probably not.)

Of course, there is no reason to stop at two categories. You can have a cube, or a hypercube, or as many dimensions as you want. The OLAP community knows all about the n-cubes you get from data mining, and the Lotus Improv tradition, now proundly carried on at quantrix.com (insert plug from highly satisfied customer here) has always been about presenting such n-cubes cleverly as spreadsheets.

The conference proper ended in plenary session with work by Henry Thompson of W3C that extends my own TagSoup project, which provides a stream of SAX events based on nasty, ugly HTML, to have a different back end. TagSoup is divided into a lexical scanner for HTML, which uses a state machine driven by a private markup language, and a rectifier, which is responsible for outputting the right SAX events in the right order. It's a constraint on TagSoup that although it can retain tags, it always pushes character content through right away, so it is streaming (modulo the possibility of a very large start-tag).

Henry's PyXup replaces the TagSoup rectifier, which uses another small markup language specifying the characteristics of elements, with his own rectifier using a pattern-action language. In TagSoup, you can say what kind of element an element is and something about its content model, but you can't specify the actions to be taken without modifying the code. (The same is technically true of the lexical scanner, but the list of existing actions is pretty generic.) In PyXup, you can specify actions to take, some of which take arguments which can be either constants or parts of the input stream matched by the associated pattern. This is a huge win, and I wish I'd thought of it when designing TagSoup. The question period involved both Henry and I giving answers to lots of the questions.

To wrap up we had, as we always have (it's a tradition) a talk by Michael Sperberg-McQueen. These talks are not recorded, nor are any notes or slides published, still less a full paper. You just hafta be there to appreciate the full (comedic and serious) impact. The title of this one was "Topic maps, RDF, and mushroom lasagne"; the relevance of the last item was that if you provide two lasagnas labeled "Meat" and "Vegetarian", most people avoid the latter, but if it's labeled "Mushroom" (or some other non-privative word) instead, it tends to get eaten about as much as the meat item). The talk was, if I have to nutshell it, about the importance of doing things rather than fighting about how to do them. At least that was my take-away; probably everyone in the room had a different view of the real meaning.

That's all, folks. Hope to see you in Montreal next August at Extreme Markup 2008.


Extreme Markup 2007: Thursday

This is a report on Extreme Markup 2007 for Thursday.

My first talk of the day was on a fast XSLT processor written in C++ and designed for large documents (more than 2 gigs) and high performance. Documents are loaded into memory using event records, which are essentially just reifications of SAX events, with parent and next-sibling links. Because internal links are record numbers rather than raw pointers, the internal limit is 2 GB of events rather than bytes (I think that's the reason). The authors did an interesting experiment with compiling multiple XPaths into a single DFA and executing them in parallel, but found that overall performance was not really improved by this additional complexity.

The next presentation was by Eric Freese, about chatbots and RDF. The chatbot here uses AIML, a language for representing chat scripts, and enhances it by allowing RDF triples (notably Dublin Core and FOAF information) to be stuffed into the bot so it knows a lot more about a lot more. The bot, called ALICE, follows the classical Eliza/Doctor tradition; it applies rewrite rules to the input to produce its output. (Oddly, Eric had believed that Eliza was hardcoded rather than script-driven until I pointed out otherwise during the comment period; to watch the Doctor script in action, fire up emacs and type "Alt-x doctor", or type "Alt-x psychoanalyze-pinhead" and hit Ctrl-G after a few seconds.)

James David Mason told us all about the application of topic maps to very complex systems, in this case pipe organs. Organs are even more complicated than they appear to be from the outside. (The reviewers, we were told, tended to say things like "I wouldn't attend this talk because I'm not interested in organs"; I think they missed the point. There were, however, wonderful multimedia interruptions of the talk in the form of excerpts from music being played on the organs under discussion. Usually James tells us censored versions of stories from Y-12, the secret national laboratory at Oak Ridge, Tennessee. (They're starting to make things with FOGBANK again after many years of not using it, and it's a big concern.) This time he talked about his non-work instead of non-talking about his work; it was great.

Next I heard some folks from Portugal talking about digital preservation of the Portuguese national and local archives. They use PDF/A to preserve documents of all sorts, but they also need to preserve the contents of relational databases, with tables, views, triggers, stored procedures, and all. Dumping them as SQL has serious portability problems, so they decided to carefully define a database markup language to handle all the metadata as well as the data itself. That way, information can be authentically transferred from one database engine to another without loss.

The last (and plenary) presentation of the day had not a speaker but "listeners" from W3C, OASIS, and ISO, asking what things should be standardized in addition to what we already have. Audience members concentrated on processes for stabilizing and withdrawing standards (ISO has this, the other two don't), the desire for more and earlier feedback processes and fewer faits accomplis, and other meta-issues. There were a few requests for additional concrete standards; unfortunately, I don't have notes. The session ended without too many real fights.

On Wednesday I talked about the nocturne on naming, but it was actually held on Thursday. Much of it was spent discussing the FRBR model of works, expressions, manifestations, and items. For computer-document purposes, a work is the abstract work, e.g. Hamlet; an expression is a particular kind realization, like a particular edition of the text or recording of a performance; a manifestation is the expression in a particular format such as .txt, .html, .doc, or .avi; and an item is a particular copy residing on a particular server, disk, or tape. Ideally, there should be separate URIs for each of these things.

A commenter asked what about topic maps I was defending: primarily the clear topic-map distinction between a URI used to name a document (a subject locator in topic maps jargon) and the same URI used to refer to the subject matter of the document (a subject indicator). Theoretically, RDF uses different URIs for these separate purposes, but that leads to paradoxes -- the document you get when you GET a subject indicator, what's the URI that identifies that? Furthermore, how do you tell which is which? In the topic-maps data model, a topic simply has some URIs which are locators and others which are indicators.

Extreme Markup 2007: Wednesday

This is a report on Extreme Markup 2007 for Wednesday.

The first talk of the morning was by David Dubin, and was about an alternative approach to reifying RDF statements so that one can make RDF claims about existing RDF statements (such as who made them, and where and when, and whether and how much you should believe them). The classical approach is to model the statement as a node, and then assert Subject, Predicate, and Object properties about this node. David's approach involves using the XML/RDF for the claim itself as the object of RDF claims. I can't say I understood what he was driving at very well: it seems to me that the main deficiency with RDF that makes reification necessary is that you can't state an RDF sentence without also claiming that it is true. This is convenient in simple cases, but annoying when you want to do meta-RDF.

Paolo Marinelli analyzed alternative approaches to streaming validation. W3C XML Schema provides what he calls a STEVE streaming discipline: at the Start tag, you know the Type of an element, and at the End tag, you can make a Validity Evaluation. XML Schema 1.1 proposes to provide various kinds of conditional validation using a subset of XPath 1.0, but does not (in the current draft) provide the full power of what is actually streamable in XPath.

The core of this paper is the classification of XPath expressions into various axes and operations, specifying when you can determine the value of the expression and at what memory cost ("constant" and "linear-depth" mean streamable, "linear-size" means not streamable). See Table 1 in the paper for the details.

Finally, Paolo proposes an extended streamability model called LATVIA, in which there are no restrictions on such XPaths, and schemas are marked streamable or non-streamable by their authors (or their authors' tools, more likely). The difficulty here is that implementors' tools that depend on knowing element types early will wind up being unable to process certain schemas, which will result in a process of negotiation: "Can't you make this streamable?" "Well, no, because ..." "That will cost you one billion stars."

Moody Altamimi gave a superb talk on normalizing mathematical formulae. There are two flavors of MathML, Presentation and Content; the former is about layout (it's close to LaTeX conceptually), the latter is meant to express meaning, and is closer to Lisp. Even in Content MathML, there will be lots of false negatives because of non-significant differences in the way authors express things: it's all one whether you say an integration is from m to n, or whether it's over the interval [m,n], but Content MathML uses two different notations for this. Similarly, we can presume that + represents an associative operation, and do some transforms to unify (+ a b c), (+ (+ a b) c), and (+ a (+ b c)). On the other hand, we don't want to go overboard and unify (+ a 0) with a; if the author wrote "a + 0", presumably there was a good reason.

The next talk I attended was about hierarchical processing in SQL database engines, and was the worst presentation of the conference: it was a marketing presentation rather than a technical one ("their products bad, our product good"), and furthermore it was a marketing presentation that was all technical details, and as such exceedingly boring. Furthermore, it assumed a detailed ANSI SQL background rather than an XML one. I'll read the paper, because I'm interested in the subject, but I'm not very hopeful.

Liam Quin of the W3C told us all about XQuery implementations, and which ones support what and how well, and what kinds of reasons you'd have for choosing one over the other, all without making any specific recommendations himself. He said that in general conformance was good across the major implementations, and performance depended too heavily on the specifics of the query to generalize.

At this point I got pretty burned out and skipped all the remaining regular talks for the day, except for Ann Wrightson's excellent overview of model-driven XML, why the instance documents are practically unreadable (too high a level of meta-ness, basically), and what can be done (not much, short of defining local views that map into the fully general modeled versions). I spent a fair amount of time in the poster room, though I didn't take notes. I also attended the nocturne that evening on URIs and names, where I defended the Topic Maps view of things with vim, vigor, and vitality.

Extreme Markup 2007: Tuesday

This is a report on Extreme Markup 2007 for Tuesday.

Tuesday started, as is traditional, with a welcome from James David Mason, Steve Newcomb, and Michael Sperberg-McQueen, all luminaries of the XML community, followed by a brief practical introduction by Debby LaPeyre (where are the bathrooms? where is the coffee? who can help you?) and the keynote speech by Tommie Usdin. Debbie and Tommie have been running this conference for years, and are now doing so directly through their company, Mulberry Technologies, rather than via IDEAlliance -- a parting of the ways with clear benefits for both sides, I believe.

Tommie's talk, "Riding the wave, riding for a fall, or just along for the ride" was about a lot of things, but what drew my attention was the question of edge cases. 95% of all XML, we are told, is not written by human beings but generated as part of Web Services calls. And then again, depending on who you ask, 95% of XML is RSS feeds. We can expect the list of 95%s to grow in future. Or perhaps 95% is really human-authored content after all. Who knows?

Next, still in plenary session, Tom Passin presented on the practical use of RDF for system modeling. The interesting feature here for me was an indented-plain-text representation of RDF that can be rewritten by a simple Python program as RDF/XML. It takes advantage of the fact that people naturally write lists, naturally indent the subpoints (which can be modeled as RDF properties), and can easily extend the indentation to add values. Here's an example from his paper:

   rdfs:label::Office for Federated Systems

The final plenary session was Michael Kay, the author of the excellent XSLT processor Saxon, on optimizing XSLT using an XSLT transformation XSLT is written as a tree, and it's also an excellent language for transforming trees, and optimization is mostly about transforming one tree into another. For example, an expression of the form count(X) = 0 in XQuery can be rewritten as empty(X); it's a lot faster if you don't have to count thousands of Xs just to determine whether there are any. (Saxon treats XQuery as just another syntax for XSLT 2.0.)

Michael's approach involves having lots of simple rewrite rules of this type, and then iterating over the list until no more changes are being made. That's important, because one optimization may expose the possibility of another optimization being made. Sometimes it pays to recognize a general XSLT expression as a special case, and provide a processor-specific extension that provides that special case efficiently. For this reason, the rewrites are as a whole processor-specific. Michael talked in particular about the case of having lots of boolean-valued attributes on an element whose content model is known. Rewriting can cleverly pack those attributes into a single integer-valued attribute with special functions to test bit combinations.

Update: Michael also pointed out that XSLT optimizers don't know anything about the data, unlike the built-in SQL query optimizers that databases use, and so have to make guesses about how much of what you're going to find in any input document. Someone asked if he had made any explorations in that direction: he said not as yet.

In the question period, Michael said that Saxon is moving more and more to generating JVM/CLR bytecodes rather than interpreting the decorated tree at runtime. Some cases are still difficult to make fast in generated code: for example, given the path $x/a/b/c/d/e, if $x is known to be a singleton, the results will already be in document order, but if not, a sort must be performed. This could obviously be compiled into conditional code, but it's messy and can lead to a combinatorial explosion as it interacts with other features. Deciding at run time is just easier.

Most of the remaining papers were given two at a time, although recognizable tracks don't really exist at Extreme. I went to see David Lee of Epocrates on getting content authored in MS Word into appropriate XML. The core of this talk was an extended lament on how authors insist on using Word; even if you provide specialized authoring tools, they compose in Word and then cut and paste, more or less incorrectly, into the specialized tool. Epocrates has tried a variety of strategies: Word styles (authors won't use them), tagged sections (authors screw them up), form fields (plaintext only, so authors delete them and type in rich text instead). In the end, they adopted Word tables as the safest and least corruptible approach. A few Word macros provide useful validations, and when the document is complete, a Word 2003 macro rewrites it using Word 2003 XML (unless it is already in that format). I pointed out that the approach of having authors use Word and saving in plain text was also viable, leaving all markup to be added by automated downstream procssing; David said that design was too simple for the complex documents his authors were creating.

Next I heard Michael Sperberg-McQueen on his particular flavor of overlap; I have nothing in particular to add to his paper, which is extremely accessible.

As a RELAX NG geek, I was interested to hear about Relaxed, a completely new implementation of RELAX NG and Schematron validation. Update: I forgot to mention that it also provides NVDL (which accounts for the name), a meta-schema language which dissects a document by namespace, local root element, or both, and passes different fragments of the document to different validators using different schemas. JNVDL uses the Sun Multi-Schema Validator to translate any other schema language you can think of into RELAX NG. You can try the validator online.

I should mention that this conference deliberately provides long breaks, a long lunch, and encourages in-the-halls discussion, one of the most important parts (some say the only important part) of any conference. So if I didn't attend the maximum possible number of talks, there were reasons for that. (Sometimes you just need to catch up on your email, too.)

Extreme Markup 2007: Monday

This is a report on Extreme Markup 2007 for Monday.

Monday was not technically part of Extreme at all; it was a separate workshop called "International Workshop on Markup of Overlapping Structures 2007". Overlap is a special interest of mine: how to deal with marking up things like books, where you may want to preserve both the chapter/paragraph/sentence structure and chapter/page structure (or even just page structure, in books where the chapter boundary doesn't mean the beginning of a page). Bibles can add chapter/verse structure, and indeed one of the presenters was formerly with the Society of Biblical Literature.

I won't give details of the presentations: you will eventually be able to see them on the Extreme Markup site, and googling for LMNL, Trojan Horse markup, TexMecs, Goddag structures, and rabbit/duck grammars should provide more than enough information. I will, however, say a bit about how overlap is typically represented in XML, and then about the two different kinds of (or use cases for) overlap that this workshop identified.

A variety of techniques have been used to represent overlap, but work in the field is generally converging on using XML augmented with milestone elements. A milestone is an empty element that represents the start-tag or end-tag of a virtual element that can overlap with other virtual elements or the actual elements of the document. A milestone is tagged with an attribute to show whether it is meant to be seen as a start-tag or an end-tag, and how the start- and end-tags match up. So <foo sID="xyz"/> would indicate the beginning of a logical element, and <foo eID="xyz"/> would indicate the end. The xyz has no particular significance, but must be present in exactly one start milestone and one end milestone. This technique has the advantage that the virtual element's name is an element name, potentially belonging to a namespace. Clever XSLT processing can be used to convert real elements into virtual ones or (provided no overlap of real elements results) vice versa.

Overlap of the first kind descends from the old SGML CONCUR feature, and is basically about having multiple hierarchies over a single document, like the examples I gave above. Typically you want to be able to pull out each individual hierarchy as a well-formed XML expression without milestones, and then look at it or validate it or process it however you like. Often this is all that's needed for overlapping documents, and it's then a matter of deciding whether all the elements are to be milestones in the base form of the document, or whether some virtual elements (the most important ones) are left as real XML elements rather than milestones.

Overlap of the second kind cares about ranges rather than elements, interval arithmetic (and computational geometry) rather than trees and DAGs, inclusion rather than dominance. It's the kind that most interests me, and LMNL is in my highly biased opinion the state of the art here. If you want dominance information, you can mark it up using annotations (LMNL's broadened version of attributes), or just infer it from what ranges include what, but you aren't assumed to care about it: basic LMNL just has a simple list of ranges over text.

From 2:00 PM onward there was a kind of Quaker meeting, with people going to the mike and talking for 5 minutes. Originally there were speakers and audience members asking questions, but very quickly the discussion became general, as signaled by the audience mike being moved to the front of the room. (When you are in this situation, always use the mike: people with hearing aids can't hear you no matter how loud you yell.) Nothing new emerged here, but participants got to understand one another's viewpoints and systems pretty well. Extreme probably will do this workshop again in a few years, and some other kind of workshop (XML data binding was suggested) next year.


At the end of the day ...

... it's time to wake up and smell the coffee.



I am one.

This is a personal blog. Nothing I say here represents the views of anyone else.


Lawrence of Arabia tells a tale

Sunset came down, delightfully red, and after the feast the whole party lay round the outside coffee-hearth lingering under the stars, while Auda and others told us stories. In a pause I remarked casually that I had looked for Mohammed el Dheilan in his tent that afternoon, to thank him for the milch camel he had given me, but had not found him. Auda shouted for joy, till everybody looked at him; and then, in the silence which fell that they might learn the joke, he pointed to Mohammed sitting dismally beside the coffee mortar, and said in his huge voice:--

'Ho! Shall I tell why Mohammed for fifteen days has not slept in his tent?' Everybody chuckled with delight, and conversation stopped; all the crowd stretched out on the ground, chins in hands, prepared to take the good points of the story which they had heard perhaps twenty times. The women, Auda's three wives, Zaal's wife, and some of Mohammed's, who had been cooking, came across, straddling their bellies in the billowy walk which came of carrying burdens on their heads, till they were near the partition-curtain; and there they listened like the rest while Auda told at length how Mohammed had bought publicly in the bazaar at Wejh a costly string of pearls, and had not given it to any of his wives, and so they were all at odds, except in their common rejection of him.

The story was, of course, a pure invention -- Auda's elvish humour heightened by the stimulus of Revolt -- and the luckless Mohammed, who had dragged through the fortnight guesting casually with one or other of the tribesmen, called upon God for mercy, and upon me for witness that Auda lied. I cleared my throat solemnly. Auda asked for silence, and begged me to confirm his words.

I began with the introducing phrase of a formal tale: In the name of God the merciful, the loving-kind. We were six in Wejh. There were Auda, and Mohammed, and Zaal, Gasim el Shimt, Mufaddhi and the poor man (myself); and one night just before dawn, Auda said, 'Let us make a raid against the market'. And we said, 'in the name of God'. And we went; Auda in a white robe and a red head-cloth, and Kasim sandals of pieced leather; Mohammed in a silken tunic of 'seven kings' and barefoot; Zaal ... I forget Zaal. Gasim wore cotton, and Mufaddhi was in silk of blue stripes with an embroidered head-cloth. Your servant was as your servant.'

My pause was still with astonishment. This was a close parody of Auda's epic style; and I mimicked also his wave of the hand, his round voice, and the rising and dropping tone which emphasized the points, or what he thought were points, of his pointless stories. The Howeitat [tribe] sat silent as death, twisting their full bodies inside their sweat-stiffened shirts for joy, and staring hungrily at Auda; for they all recognized the original, and parody was a new art to them and to him. The coffee man, Mufaddhi, a Shammar refugee from the guilt of blood, himself a character, forgot to pile fresh thorns on his fire for fixity of listening to the tale.

I told how we left the tents, with a list of the tents, and how we walked down towards the village, describing every camel and horse we saw, and all the passers-by, and the ridges, 'all bare of grazing, for by God that country was barren. And we marched: and after we had marched the time of a smoked cigarette, we heard something, and Auda stopped and said, 'Lads, I hear something'. And Mohammed stopped and said, 'Lads, I hear something'. And Zaal, 'By God, you are right'. And we stopped to listen, and there was nothing, and the poor man said, 'By God, I hear nothing'. And Zaal said, 'By God, I hear nothing'. And Mohammed said, 'By God, I hear nothing'. And Auda said, 'By God, you are right'.

'And we marched and we marched, and the land was barren, and we heard nothing. And on our right hand came a man, a negro, on a donkey. The donkey was grey, with black ears, and one black foot, and on its shoulder was a brand like this' (a scrabble in the air), 'and its tail moved and its legs: Auda saw it, and said, 'By God, a donkey'. And Mohammed said, 'By the very God, a donkey and a slave'. And we marched. And there was a ridge, not a great ridge, but a ridge as great as from the here to the what-do-you-call-it (hi biliyeh el hok) that is yonder: and we marched to the ridge and it was barren. That land is barren: barren: barren.

'And we marched: and beyond the what-do-you-call-it there was a what-there-is as far as hereby from thence, and thereafter a ridge: and we came to that ridge, and went up that ridge: it was barren, all that land was barren: and as we came up that ridge, and were by the head of that ridge, and came to the end of the head of that ridge, by God, by my God, by very God, the sun rose upon us.'

It ended the session. Everyone had heard that sunrise twenty times, in its immense bathos; an agony piled up of linked phrases, repeated and repeated with breathless excitement by Auda to carry over for hours the thrill of a raiding story in which nothing happened; and the trivial rest of it was exaggerated the degree which made it like one of Auda's tales; and yet, also, the history of the walk to market at Wejh which many of us had taken. The tribe was in waves of laughter on the ground.

Auda laughed the loudest and longest, for he loved a jest upon himself; and the fatuousness of my epic had shown him his own sure mastery of descriptive action. He embraced Mohammed, and confessed the invention of the necklace. In gratitude Mohammed invited the camp to breakfast with him in his regained tent on the morrow, an hour before we started for the swoop on Akaba. We should have a sucking camel-calf boiled in sour milk by his wives: famous cooks, and a legendary dish!


Essentialist Explanations, 13th edition

Thanks to the good folks at Metafilter, I have just updated my Essentialist Explanations page to its 13th edition, including (naturally) a new section entitled "Meta".


Summer haiku

How gracefully she
   inhaling the red roses
      goes down on her knees.


Family felines

(I know, it's Thursday, not Friday. So what.)

Katie sleeping

And what do you want?

Okay, I'm awake now. Can I help you?

It was not my catnip. I'm innocent, innocent I tell you....


A shell script mini-clinic

On one of the mailing lists I subscribe to, someone posted the following Unix shell script as a wrapper for a program which unfortunately littered the current directory with temporary files which it did not remove. (Because this post is not about the faults of that program, I've replaced the reference to it in the fourth line of the script).

pushd . > /dev/null
cd /tmp
some-program $@
popd > /dev/null

The author added, "I'm sure there's a better way to write the script, but this would do the trick." So it does. However, the code exhibits a number of misunderstandings of how shell scripts work that I think are worth clarifying.

The first and fifth lines are used to preserve and restore the current working directory. However, a script (or any Unix process) always has its own working directory; changing the working directory in a script does not affect the caller of the script in any way. This is not true for shell startup scripts like .login, .profile, and .bashrc, or for Windows .bat and .cmd files, all of which should be careful not to permanently change the current working directory.

So the first improvement is to remove the first and fifth lines entirely. Since these are the only parts dependent on the particular shell being run, namely bash, it is now possible to change the first line to read #!/bin/sh. On Linux and Cygwin, this is a distinction without a difference, but on Solaris and BSD Unixes, sh is a different shell from bash, and less of a resource hog; on older operating system versions, bash may not even be present. Therefore, it's always best practice to write simple shell scripts like this one as portably as possible, running them with sh whenever you can. (Of course, there is no reason to avoid bash-specific features in full-fledged bash programs, where you are using bash as a programming language like Perl or Python.)

Lastly, using $@ to mean "all the arguments" is unsafe if any of the arguments might contain a space character. Instead, use "$@". For the same reason, $* should be avoided entirely, unless you want to reparse the arguments according to whatever whitespace they contain. The distinction doesn't happen to matter in this script, but it's a very good habit to get into in general, because some day you will have to process a file (possibly coming from Windows or the Mac) with a space in its name, and then your script will break embarrassingly.


Ava Cowan Foxy

Here is a condensed version of my web site, found in the Google cache of a server which is apparently down. I've neutered the links and forms, most of which referred to equally nonexistent resources.


Essentialist Explanations, 12th edition

I've finally caught up with the incoming entries, and published the 12th edition of my page Essentialist Explanations, a list of "simplistic and often humorous" (Langmaker) explanations of the form "Language X is essentially language Y under conditions Z." There are now 876 of them; new entries are always solicited, though it takes time for me to post them.

Read and enjoy.


Comments policy for Recycled Knowledge

I reserve the right to remove comments from this blog for any reason or no reason. I will do so primarily in the attempt to maintain a civil tone here. So far this hasn't been necessary, and I hope it never will be. Of course, I will also remove anything that might get me in legal trouble or that I think is spam: so far I've done the latter but not the former.


TagSoup 1.1 released

TagSoup 1.1 is now released at http://tagsoup.info. This release includes JAXP classes for those who want them. HTML comments have been completely revamped (the implementation has been broken for many releases), and a few other small bugs fixed. Do upgrade, if only for the comment handling.


Regularized Inglish: wordlists

Here are some Regularized Inglish wordlists from Wijk.
The following words (from a list of the 1000 most frequent words) have the stressed vowel changed:
abuv, afternoone, agen, agenst, aul, aulmoste, aulreddy, aulso, aultho, aulways, amung, anuther, anser, eny, enything, ask, baul, bair (verb), bayr (noun), beutiful, beuty, becoz, becum, beloe, blud, bloe, branch, breik, bruther, braught, bild, bilding, bilt, bisness, bisy, bye (for "buy"), caul, can't, chance, chainge, Shicago, class, cullour/cullor, cum, cumming, command, cumpany, controel, cood, coodn't, cuntry, corse, cort, cuver, dance, dainger, ded, deth, demand, discuver, doo, duz, duzn't, dun, doen't, dor, dubble, erly, erth, yther/eether, Ingland, Inglish, enuff, example, ie (for "eye"), faul, flor, floe (for "flow"), foar (for "four"), frend, frunt, fooll (for "full"), guvernment, grant, grass, greit, groope, groe, haf, haul, hed, helth, herd (for "heard"), hart (for "heart"), heven, hevy, hight, insted, iorn, jurnal, knoe, knoen, knolledge, laf, lern, Lundon, looze, luv, loe, loer, mashien, meny, mesure, munney, munth, muther, moove(ment), nyther/neether, nun (for "none"), nuthing, wunce, wun, oenly, uther, aught (for "ought"), oen, peeple, plesant, plesure, pritty, proove, pooll (for "pull"), poot (for "put"), quorter, red (for "read"), reddy, receev, remoove, roel, sault, sed, ses, shoo, shood (for "should"), shoelder, shoe (for "show"), smaul, snoe, sum(thing,times), sun (for "son"), soel, spred, strainge, tauk, taul, tair, thare, tharefore, tho, thaught, thru, tuch, tord (for "toward"), trubble, too (for "two"), waul, wont (for "want"), wor (for "war"), worm (for "warm"), wos, wosn't, wosh, Woshington, wotch, wauter, wair, wether, whot(ever), whare, hoo, hoel, hoome (for "whom"), hooze, wooman, wimmen, wunder(ful), woen't, wurd, wurk, wurld, wood (for "would"), woodn't, yoo, yung, yoor(self), yoo'r, yoothe.
And here are the words that have an unstressed vowel changed. Most unstressed vowel spellings are left alone in Regularized Inglish, but these are changed to avoid confusion.
capten, certen, certenly, mounten, forren, felloe, folloe, folloeing, narroe, tomorroe, windoe, yelloe.
From the same list, changes involving adding or removing a final e:
afternoone, childe, wilde, behinde, finde, kinde, minde, winde (verb), sine (for "sign"), moste, poaste, bothe, truthe, coole, foode, foole, moone, roofe, roome, scoole, soone, gon, Europ, ar, wer, figur, promis, purpos, minut, hav, giv, liv, nativ, believ, leav, serv, twelv, themselvs.
And finally, words with changes in consonants:
caracter, dout, gard, onnour/onnor, our (for "hour"), iland, lissen, ov, offen, scoole, shugar, shure, studdy, sugest, sunn, winn.
Here are the analogous lists for the next most frequent 1000 words:
accumpany, ahed, aincient, eny(body,wun,way), arrainge, baught, boe, boel, bred, brekfast, brest, breth, braud, bery, boosh, cam (for "calm"), cassle, chaimber, clark/clerk, curnel, cumfort(able), cupple, currage, cuzin, daingerous, discuvery, disese, duzen, ern, everywhare, exchainge, faulen, faulse, flud, foek, faught, foarth, frend(ly,ship), foolly (for "fully"), glance, gloe, guvernor, greitly, groen, groth, hunney, improove, jurney, kee, lether, luvly, luver, mashienery, magazien, ment, utherwise, oe, oener, poliece, poar, prayr, poosh, quolity, quontity, quorrel, rainge, recaul, recuver, ruff, rute, roe, scaircely, serch, seeze, shoen, sloe (for "slow"), sloely, sum(wum,whot,where), saught, sorce, suthern, steddy, strainger, thare'z, thred, thretten, thruout, throe (for "throw"), throen, tung, tresure, unknoen, wonder (for "wander"), worn (for "warn"), welth, woolf, wunn (for "won"), wurker, wurry, wurse, wurship, wurst, wurthy, woonde, yoo'd, yoo'll, yoo'v.
automobiel, curten, equol, welcum, holloe, shaddoe, sorroe.
blinde, kindely, desine, worne, noone, shoote, smoothe, troope, handsom, determin, engin, examin, imagin, purchase, favourit/favorit, immediat(ly), opposit, privat, senat, separat (adj.), havn't, activ, representativ, observ, ourselvs, preserv, reserv.
ashure, clime, Crist(ian,mas), det, onnest, leag, Lincon, sacrifise, summ, shurely, sord, Tomas, whissle.
These lists probably contain minor errors of transcription.

Regularized Inglish: theory

I see I promised to post on Regularized Inglish (RI) back in 2005, but never got around to it. Here's a brief explanation.

Axel Wijk's Regularized Inglish is a massive multi-decade job (completed in the 1950s, so there's nothing available online about it) of analyzing practically every word in the language, figuring out what the complicated rules behind the spelling system really are, and identifying all the truly irregular words and proposing properly rule-governed spellings for them. English, e.g. is truly irregular in its first vowel only, and so it becomes Inglish.

The underlying principle of RI is that every spelling shall correspond to at most a few sounds, preferably only one; multiple spellings for a single sound, however, are tolerated. Thus -ough is kept for bough, but not for rough, through, plough, hiccough, hough, or borough. Why choose bough? In order to consistently apply the RI rule that says "gh has no effect on the pronunciation of any word".

In my opinion, Wijk goes a bit far in a few places: for example, he introduces dh for the sound of th in father in other than initial positions (the, not dhe) for very little gain; he sorts out long "a" into a as in fat and "aa" as in father; he changes s to z when pronounced that way, except in the plural of nouns (not nounz)and the third-person singular ending of verbs. I wouldn't bother with any of these changes, which have little impact on being able to pronounce words at sight.

But as reformed (not revolutionized) spellings go, RI is a Great Thing.


A haiku

Quantum mechanics:
    Which moves through the springtime pond?
        Frog or a ripple?

This is post 200.

Tagsoup 1.0.4 released!

Just a bug-fix release. See tagsoup.info.


The heart of Celebes Kalossi

I'm finally ready to explain the central ideas of my object-oriented programming model Celebes Kalossi (read the linked post first for the terminology). I've been hanging fire on it for a long time over a terminological issue, which I've decided to just punt on.

In CK, there are two kinds of relationships between classes: subtyping and incorporation. Subtyping is like the relationship between a Java interface and its superinterface; incorporation is something like C++ private inheritance. These two concepts are intertwingled in various ways in various programming languages, but CK completely separates them. Every class subtypes one or more classes except for the root of the class hierarchy (if there is one); classes can incorporate zero or more other classes.

When you declare that class A subtypes class B, you mean that A (the subclass) has all the methods that are declared or defined public in B (the superclass) and all of B's superclasses, plus those declared or defined public in A itself. You are not saying that any of the definitions provided in B or its superclasses are or are not available in A, so there is no problem if a given method is public in more than one superclass. You are also suggesting that an instance of A is Liskov substitutable for an instance of B, although it is impossible to check this property mechanically. We call the public methods of A and its superclasses the interface of A.

When you declare that class C incorporates class D, on the other hand, you mean that the non-private methods defined in class D (the incorporated class) are effectively given the same definition in class C (the incorporating class). Provided that the methods in D have been properly declared in class C, they can be invoked on objects of class C just as if they had been defined as standard or public methods of class C. It does not matter if a method is defined in class D and declared in class C (which is C++ private inheritance) or vice versa. However, the fact that a method is public in class D does not make it public in class C unless it is declared public in class C or one of its superclasses: incorporation affects behavior but not interface.

For convenience, we say that a class subtypes itself and incorporates itself. Both incorporation and subtyping are transitive: class A subtypes class B's superclasses as well as class B itself, and classes incorporated by class D are implicitly incorporated by class C as well. All the methods in the incorporated classes of C are placed on an equal footing: it does not matter how they were incorporated. A loop in the subtype hierarchy specifies that all the types in the loop have the same interface; a loop in the incorporation hierarchy is ignored, so even if C incorporates D, D can also incorporate C. Note that an implementation may provided types which are outside the model: typical examples would be numeric types, strings, and exception objects.

If a method is declared in a class X or in any of the classes incorporated (directly or indirectly) by class X, it must also be defined exactly once in one of those classes. (If there are no definitions of some method, the class is abstract and must be declared as such.) If incorporating class Z would cause conflicts, CK provides for renaming or hiding unwanted methods when incorporating a class: class L can incorporate class M including specified methods (in which case all others are hidden), excluding specified methods, or renaming a method as a new name. Classes that incorporate L see the changed view.

There are no inheritance rules in CK, because there is no inheritance as such: if you want X to use the same implementation of method m as its superclass Y, then incorporate whichever class Z from which Y gets its implementation of method m. You can do the same thing for classes that are not related to you in the superclass hierarchy, or even for your subclasses if you really want to -- the model is nothing if not flexible.

I'll have another post, hopefully soon, about how CK might be implemented on the JVM or CLR.


TagSoup 1.0.3 released!

For TagSoup users among my readers, there's a new release at http://tagsoup.info, providing control of the output encoding and fixing a few bugs.

If you aren't a user but think you might like to be, TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Update: TagSoup 1.0.2 had some brown-paper-bag bugs, so I've released 1.0.3 as a replacement.


The old farts have it

Age and treachery, it is said, beat youth and X every time. But what is X?

Googling suggests that this part of the proverb isn't very stable. I find: enthusiasm, skill, idealism, enthusiasm, innocence, inexperience and experience, endurance, exuberance, ability, virility, skills, reflexes, speed, talent, baggy pants, and a bad haircut.