2011-01-25

MicroXML Editor's Draft

I have assembled a very preliminary Editor's Draft for MicroXML. It is about 20 pages long, and covers for MicroXML what is covered by XML 1.0 (Fifth Edition), Namespaces in XML 1.0 (Third Edition), XML Infoset (Second Edition), Namespaces in XML 1.0 (Third Edition), xml:id 1.0, and XML Base (Second Edition), which total about 82 pages.

It still needs formal normative references, and there are a few open issues listed in section 10. Comments and corrections are earnestly solicited either in comments here or in email.

2011-01-23

The Hoose at Pooh's Neuk / Kidnappit

Well, I've finished the second Pooh book in Scots, and it seems to me just a hair less good than the original. I have no idea if that's the translator, the author, or just me. Definitely it's harder to understand, perhaps more Scots and less English, though not so much you'd notice it right off. I do, however, want to present one of Pooh's hums in three versions: Robertson's translation, a literal back-translation by me, and Milne's original. It comes from Chapter VIII, "Where-intil Wee Grumphie does a Gey Graund Thing":

I lay on ma chist
   I lay on my chest
      I lay on my chest
And thocht I wid jist
   And thought I would just
      And I thought it best
Pit on I was haein a sleep I had missed;
   Put on I was having a sleep I had missed;
      To pretend I was having an evening rest;
I lay on ma wame
   I lay on my belly
      I lay on my tum
Some verse tae declaim
   Some verse to declaim
      And I tried to hum
But naethin particular seemed to strike hame.
   But nothing particular seemed to strike home.
      But nothing particular seemed to come.
Ma face was flat
   My face was flat
      My face was flat
On the flair, and that
   On the floor, and that
      On the floor, and that
Is aw guid and weel for an acrobat;
   Is all good and well for an acrobat;
      Is all very well for an acrobat;
But it doesna seem fair
   But it does not seem fair
      But it doesn't seem fair
Tae a Freendly Bear
   To a Friendly Bear
      To a Friendly Bear
Tae streek him oot unner an auld creel-chair.
   To stretch him out under an old wicker-chair.
      To stiffen him out with a basket-chair.
And a kind o squoot
   And a kind of squoot
      And a sort of sqoze
He could dae wioot
   He could do without
      Which grows and grows
Is no that braw for his puir auld snoot;
   Is not that pleasant for his poor old snoot;
      Is not too nice for his poor old nose,
And a kind o squeed
   And a kind of squeed
      And a sort of squch
Is sair indeed
   Is grievous indeed
      Is much too much
On his mooth and his lugs and the back o his heid.
   On his mouth and his ears and the back of his head.
      For his neck and his mouth and his ears and such.

Plainly some of the changes are enforced by the rhyme, but I do think "I lay on ma wame / Some verse tae declaim / But naethin particular seemed to strike hame" is better, and better poetry, than "I lay on my tum / And I tried to hum / But nothing particular seemed to come."

I'll also mention here Kidnappit, a graphic novel based on Robert Louis Stevenson's classic romance Kidnapped, which is available both in English and in Scots. In the original book, much of the dialogue is Scots-and-water, whereas all the narration, though in first person by a Scotsman, is in standard English or nearly so. I don't know how the characters speak in the English comic, but some lines of dialogue can be usefully contrasted between the two versions that I have read: when our hero David Balfour confronts the Red Fox in Stevenson's original (chapter 17), he says: "I am neither of his people [the Stewarts] nor yours [the Campbells], but an honest subject of King George, owing no man and fearing no man." But in the Scots comic, what David says to the Reid Tod is "I am nae aucht of James's folk or o yours. I am a leal subject o King George, aucht a nae man and feart o nane." Somehow I think that's more likely to be what the real David Balfour (so to speak) actually said.

2011-01-21

The MicroLark parser

I've been developing a parser for MicroXML which I have dubbed MicroLark, in honor of Tim Bray's original 1998 XML parser Lark. I didn't take any code from Lark, but we ended up converging on similar ideas: it provides both push and tree parsers (as well as a pull parser), it is written in Java, and I intend to evolve it as MicroXML evolves. However, as MicroXML is much smaller than XML, so MicroLark is about a third the size of Lark.

If you want to cut to the chase now, you can get the jar file, the source code of version 0.8, the Javadoc, and the test cases (which are based on the W3C's XML Conformance Test Suites). If you run the jar file, pass it one argument which is the file you want to parse, and you'll get the parsed output in Pyx format, a simple line-oriented format similar to SGML ESIS format. If you specify @ as the file, MicroLark will read the names of files from the standard input, and write its output to files with the same names but with .pyx added.

MicroLark 0.8 supports MicroXML according to James's most recent definition, with the addition of prefixed attribute names. MicroLark allows, but does not require, namespace definitions for these prefixes. Element names do not allow prefixes. This version provides the parser proper, the Element class, and a MicroXML writer. I'll be adding more MicroXML-specific test cases in a later release. Future versions will supply a package of iterators to allow XPath-style operations, and a validator based on MicroRNG.

The core of MicroLark is a pull parser implemented as a state machine, which is just a big switch of switches. The outer switch is controlled by the current state, and the inner switches by the current character. The parser returns when a start-tag, end-tag, character data, end of document, or error is found, making sure that parsing can be resumed smoothly afterwards. Consequently, it is not draconian, though its error recovery strategy mostly consists of resetting the current state to "character data". When the parser returns, the caller can call accessors to get the current element stack or character data, or the location and text of an error. The push parser (SAX-style) and the tree parser (DOM-style) are thin layers over the pull parser; the tree parser is draconian.

The Element class is the incarnation of the MicroXML data model, so it provides access to the name, attributes, and children of an element. Element objects are provided by all three parsers, though only the tree parser populates the children. There are the usual collection methods for fetching, searching, and mutating the attributes and children; text children are represented as Strings, and when an attempt is made to insert a text child next to an existing text child, the two are coalesced. You can create your own Element objects, and the class ensures that they always represent well-formed MicroXML (for example, the names of elements and attributes must be well-formed). There are convenience methods for retrieving the current value of an inherited attribute, and for obtaining the current values of xml:lang/lang, xml:id/id, xml:base, and the namespace of a prefixed attribute.

Users can create subclasses of Element and instruct the parser to use them by creating a factory object. Factory objects get the current element stack from the parser as well as the name of the new element, and return an instance based on them. This allows tree nodes to have their own fields and methods suitable for their use in the application, as well as the creation of tree nodes that enforce restrictions on their children such as "no text children" or "element children must belong to class X".

MicroLark is open source, licensed under the Apache 2.0 license. Check it out, play with it, report bugs and suggestions for improvement in the comments here or to cowan@ccil.org

.

2011-01-18

Winnie-the-Pooh in Scots

I have just read, and with great enjoyment, Winnie-the-Pooh in Scots, translated by James Robertson and with the original E. H. Shepard illustrations.
Lang lang syne, a lang while syne noo, aboot Friday past, Winnie-the-Pooh steyed in a forest aw by himsel unner the name o Sanders.

('Whit does "unner the name" mean?' spiers Christopher Robin.

'It means he had the name in gowd letters ower the door and he steyed unner it.'

'Winnie-the-Pooh wisna jist shair,' says Christopher Robin.

'I am noo,' says a gurly [growly] voice.

'Then I'll cairry on," says I.)

The other characters, I should mention, are Wee Grumphie (Piglet), Heehaw (Eeyore, and a much more sensible spelling than "Eeyore" to an American like me), Rabbit, Hoolet (Owl, cf. English howlet, a mixture of howl and owlet, or their French originals), Kanga, and the Bairn Roo. Teeger doesn't come in until The Hoose at Pooh's Neuk, which has just arrived but I haven't read yet.

The book is a delight overall, especially to one as steeped in Pooh as I have always been, and who had Scott, RLS, and Ian Maclaren mixed in with his English reading from a very early age. I suppose there were half-a-dozen or so words that I didn't know the meanings of, as I unfortunately read the book without the marvelous Dictionar o the Scots Leid, the OED for Scots, at hand. I must particularly praise the verse translations, such as this one from Chapter 1, "Where-intil we are introduced tae Winnie-the-Pooh and a wheen Bees, and the Stories stert":

It's a thocht, is it no, that if Bears were Bees,
They'd bigg their bykes at the bottom o trees.
And if that wis the wey o it (if Bees were Bears)
We widna hae tae speel up aw thir stairs.

My favorite moment, however, was at this line in Chapter 4: "These notices had been written by Christopher Robin, who was the only one in the forest who could spell; for Owl, wise though he was in many ways, able to read and write and spell his own name WOL, yet somehow went all to pieces over delicate words like MEASLES and BUTTEREDTOAST." In Robertson's translation, this becomes: "Thae notices had been written oot by Christopher Robin, the ainly yin in the Forest that could spell; for, lang-heidit though Hoolet wis in mony weys, able tae read and write and spell his ain name HOOTEL, for some reason he gaed aw through-ither when it cam to kittle words like MEASLES and BUTTERYBREID."

The next time I'm writing some software that needs a fanciful name, perhaps I'll call it Hootel.

2011-01-01

MicroXML and JSON

Warning: You don't know about MicroXML without you have read a blog post by the name of "More on MicroXML"; but that ain't no matter, because you can click on the link and read all about it.

Warning Too: Carefully note the word "and" in the title. There's a reason why it's not "versus".

The whole point of MicroXML is to provide an XML spec (and associated data model) which is small and simple enough, and easy enough to implement, that it can go where no XML has gone before. Of course, JSON is already filling part of that niche, and it's even simpler than MicroXML. So MicroXMLers have two choices: think up reasons why JSON is bad, or figure out ways to coexist with it. My personality being what it is, I choose the second.

The goals of this posting are a) to specify a way to losslessly and uniquely transform JSON documents into MicroXML documents and back, and b) to specify a way to add markup to an arbitrary MicroXML document to explain how to transform it to JSON, which probably involves some amount of loss, because if MicroXML weren't more expressive than JSON, it wouldn't have a reason to exist. Consequently, a non-goal is to specify a way to losslessly and uniquely transform MicroXML to JSON and back.

JSON values have six possible types: objects (key-value mappings), arrays (ordered lists of values), strings, numbers, booleans, and null. The simplest approach to the first goal that could possibly work is to define a MicroXML vocabulary with six elements in it, named object, array, string, number, boolean, and null, and that's what I'm going to specify. So JSON converted to MicroXML looks pretty much like JSON itself, only more verbose. Why do this at all? So that the converted JSON can be fed into a MicroXML-based or XML-based pipeline and possibly converted back to JSON at the other end. Of course, if you don't need to do that, no problem: just don't convert to MicroXML in the first place.

Five of the six types are easy to represent: an array element represents the elements of the array using its child elements; a string, number, or boolean element contains the string, number or boolean value as character content, and a null element is always empty.

Next we must choose how to represent the key-value pairs within an object. They can't be represented as attributes (that is, with the key as the attribute name and the value as the attribute value), because the JSON RFC only says that keys SHOULD be unique, not that they MUST be unique, and attribute names in XML elements MUST be unique. So we'll represent each key-value pair as a child element, and represent the value of the pair using the content of the element.

But what about the key? There are two plausible choices: use an element with the fixed name pair and specify the key (which must be a string) using a key attribute, or use the name of the element directly as the key. The first solution is general but verbose; the second solution is not general, because only a subset of strings can appear as a MicroXML (or XML) element name. We'll require MicroXML-to-JSON converters to accept both (be liberal in what you accept), but require JSON-to-MicroXML converters to use the second solution unless the key contains a character that's not valid in XML names (be conservative in what you send). So pair becomes a seventh name in the MicroXML vocabulary for JSON.

(The characters U+FFFE and U+FFFF can appear literally in a JSON string, key, or value, but can't appear in XML character content, not even using character references. These aren't likely to actually occur in JSON documents, but just for completeness we'll say that they must be escaped with JSON escaping as \uFFFE and \uFFFF. This constitutes a minor violation of the rule of verbatim round-tripping, since JSON->MicroXML->JSON will always produce escape sequences for these characters even if the original document had them appear literally, but no realistic JSON application will notice the difference.)

So much for the first goal. What about the second? We'll require JSON->MicroXML translators to adopt the rules above to begin with. What about elements and attributes present in the MicroXML that have other names? We'll say that if an element has the attribute json-type, then the value of that attribute tells us how to process it. Thus an element named list with a json-type attribute of array will be converted to a JSON array. In this process, the actual name of the element and any inappropriate content is discarded, including any character content in an element with a json-type of object or array and any content at all of an element with a json-type of null. We don't discard child elements in elements with a json-type of string, number, or boolean: instead we use the XPath value of the element, which is the same as the content of the element with any tags ignored.

What about MicroXML attributes? We discard them for all elements except those with a json-type of object, where we treat them as additional key-value pairs (excepting of course any json-key attribute).

As usual, comments are solicited.

2010-12-23

MicroRNG

This is a contribution to the MicroXML conversation. It's a stripped-down version of RELAX NG suitable for validating MicroXML documents. It excludes namespaces, since MicroXML doesn't have them either. Somewhat reluctantly, I have also jettisoned all simple types but a few and all value types except the default, since I figure that MicroXML will mostly be used by applications that need to validate string values in more complicated ways anyhow.

Generalized interleave has a high implementation cost, so I've removed it as well, except for mixed content, which I consider essential. Finally, I've ditched lists, datatype libraries other than stripped-down XSD, foreign markup, name classes, nested grammars, external file inclusion, the notAllowed pattern, divs (which are just for documentation), and definition combining methods.

Here's what's left, in the form of a compact RELAX NG grammar. When translated to XML format, this is also a MicroRNG grammar (modulo namespace issues).

start = elementElem | grammarElem

grammarElem = element grammar {startElem, defineElem*}

startElem = element start {elementElem | refElem | element choice {(startElem | refElem)+ } }

defineElem = element define {attribute name {text}, pattern+}

pattern = elementElem | textElem | mixedElem | attributeElem | valueElem | groupElem | choiceElem | optionalElem | z zeroOrMoreElem | oneOrMoreElem | refElem | dataElem

elementElem = element element {attribute name {text}, (emptyElem | pattern+)}

emptyElem = element empty {empty}

textElem = element text {empty}

mixedElem = element mixed {pattern+}

attributeElem = element attribute {attribute name {text}, (valueElem|textElem)?}

valueElem = element value {text}<

groupElem = element group {pattern+}

choiceElem = element choice {pattern+}

optionalElem = element optional {pattern+}

oneOrMoreElem = element oneOrMore {pattern+}

zeroOrMoreElem = element zeroOrMore {pattern+}

refElem = element ref {attribute name {text}}

dataElem = element data {"string" | "decimal" | "double"| "integer" | "date" | "dateTime" | "boolean" | "base64Binary"}

In addition, MicroRNG just allows a single unique element element in a definition (that is, no more than one definition of an element), even though that would reduce the convenience of RNG definitions to their DTD equivalents.  There are other possible simplifications, like getting rid of element elements as the root, or removing zeroOrMore elements in favor of optional elements wrapped around oneOrMore elements, but I judge them to be more annoying to schema authors than helpful to implementers.

Comments are gratefully solicited either here or at James Clark's blog.

2010-08-02

Somewhere A Duck Quacked

Here's a totally unauthorized reprint of Peter De Vries's story "The Irony Of It All", a satire of pulp writing, and Randall Garrett's "Look Out! Duck!", a science-fiction satire of De Vries's story, now with 50% more jokes as a result of being able to read the De Vries story first. Eventually I'll gussy this up with mutual hyperlinks connecting the two Dumbrowskis and the two Constanzas, but not today.

2010-06-14

Employed again, finally

I'm going to start work at LexisNexis on June 28 with the title of "Content Architect". I'm not sure how much I'm allowed to say about the job, but I will be dealing with UML (about which I know little as yet) and RELAX NG (about which I know plenty). This will be my first official non-programmer job ever, but I will still be writing one-off code for various purposes, so I'm not handing in my geek license just yet.

2010-04-21

The rumors that I've left Google are true

My last working day there was April 13, and I'm actively looking for work now.  If you know of anyone in the NYC area who wants a high-powered generalist who can work with both people and computers, please let me know.

Thanks, and thanks to everyone who's been sending me good wishes and job postings.

2009-12-18

Noun-noun Compounds

Here's a list of different types of noun-noun compounds. The original research was done by Ivan Derzhanski and then adapted by me for use in my Lojban reference grammar.  Unfortunately, the formatting of the on-line version is unreadable, and it's full of Lojban technical terms.  I've salvaged the content here.  All of these languages put the modifier first and the modified term (the "head") second.

After the English gloss of each compound, there's a list of non-English languages that use it.  If the compound is not used in English, there is a definition as well.  The abbreviations are explained below.  If you don't care about the Lojban, you can ignore it.

1. The head represents an action, and the modifier then represents the object of that action.

pinsi kilbra = pencil sharpener (Hun)
zgike nunctu = music instruction (Hun)
mirli nunkalte = deer hunting (Hun)
finpe nunkalte = fish hunting (Tur, Kor, Udm, Aba 'fishing')
smacu terkavbu = mousetrap (Tur, Kor, Hun, Udm, Aba)
zdani turni = house ruler (Kar 'host')
zerle'a nunte'a = thief fear (Skt 'fear of thieves')
cevni zekri = god crime (Skt 'offense against the gods')

2. The head represents a set, and the modifier the type of the elements contained in that set.

zdani lijgri = house row
selci lamgri = cell block
karda mulgri = card pack (Swe)
rokci derxi = stone heap (Swe)
tadni girzu = student group (Hun)
remna girzu = human-being group (Qab 'group of people')
cpumi'i lijgri = tractor column (Qab)
cevni jenmi = god army (Skt)
cevni prenu = god folk (Skt)

3. Conversely: the head is an element, and the modifier represents a set in which that element is contained. Implicitly, the meaning of the head is restricted from its usual general meaning to the specific meaning appropriate for elements in the given set. Note the opposition between "zdani linji" in the previous group, and "linji zdani" in this one, which shows why this kind of compound is called "asymmetrical".

carvi dirgo = raindrop (Tur, Kor, Hun, Udm, Aba)
linji zdani = row house

4. The modifier specifies an object and the head a component or detail of that object; the compound as a whole refers to the detail, specifying that it is a detail of that whole and not some other.

junla dadysli = clock pendulum (Hun)
purdi vorme = garden door (Qab)
purdi bitmu = garden wall (Que)
moklu skapi = mouth skin (Imb 'lips')
nazbi kevna = nose hole (Imb 'nostril')
karce xislu = automobile wheel (Chi)
jipci pimlu = chicken feather (Chi)
inji rebla = airplane tail (Chi)

5. Conversely: the modifier specifies a characteristic or important detail of the object described by the head; objects described by the compound as a whole are differentiated from other similar objects by this detail.

pixra cukta = picture book
kerfa silka = hair silk (Kar 'velvet')
plise tapla = apple cake (Tur)
dadysli junla = pendulum clock (Hun)

6. The head specifies a general class of object (a genus), and the modifier specifies a sub-class of that class (a species).,

ckunu tricu = pine tree (Hun, Tur, Hop)

7. The head specifies an object of possession, and the modifier may specify the possessor (the possession may be intrinsic or otherwise). In English, these compounds have an explicit possessive element in them: "lion's mane", "child's foot", "noble's cow".

cinfo kerfa = lion mane (Kor, Tur, Hun, Udm, Qab)
verba jamfu = child foot (Swe)
nixli tuple = girl leg (Swe)
cinfo jamfu = lion foot (Que)
danlu skapi = animal skin (Ewe)
ralju zdani = chief house (Ewe)
jmive munje = living world (Skt)
nobli bakni = noble cow (Skt)
nolraitru ralju = king chief (Skt 'emperor')

8. The head specifies a habitat, and the modifier specifies the inhabitant.

lanzu tumla = family land

9. The head specifies a causative agent, and the modifier specifies the effect of that cause.

kalselvi'i gapci = tear gas (Hun)
terbi'a jurme = disease germ (Tur)
fenki litki = crazy liquid (Hop 'whisky')
pinca litki = urine liquid (Hop 'beer')

10. Conversely: the head specifies an effect, and the modifier specifies its cause.

djacu barna = water mark (Chi)

11. The head specifies an instrument, and the modifier specifies the purpose of that instrument.

taxfu dadgreku = garment rack (Chi)
tergu'i ti'otci = lamp shade (Chi)
xirma zdani horse = house (Chi 'stall')
nuzba tanbo = news board (Chi 'bulletin board')

12. More vaguely: the head specifies an instrument, and the modifier specifies the object of the purpose for which that instrument is used.

cpina rokci = pepper stone (Que 'stone for grinding pepper')
jamfu djacu = foot water (Skt 'water for washing the feet')
grana mudri = post wood (Skt 'wood for making a post')
moklu djacu = mouth water (Hun 'water for washing the mouth')
lanme gerku = sheep dog (dog for working sheep)

13. The head specifies a product from some source, and the modifier specifies the source of the product.

moklu djacu = mouth water (Aba, Qab 'saliva')
ractu mapku = rabbit hat (Rus)
jipci sovda = chicken egg (Chi)
sikcurnu silka = silkworm silk (Chi)
mlatu kalci = cat feces (Chi)
bifce lakse = bee wax (Chi 'beeswax')
cribe rectu = bear meat (Tur, Kor, Hun, Udm, Aba)
solxrula grasu = sunflower oil (Tur, Kor, Hun, Udm, Aba)
bifce jisra = bee juice (Hop 'honey')
tatru litki = breast liquid (Hop 'milk')
kanla djacu = eye water (Kor 'tear')

14. Conversely: the head specifies the source of a product, and the modifier specifies the product.

silna jinto = salt well (Chi)
kolme terkakpa = coal mine (Chi)
ctile jinto = oil well (Chi)

15. The head specifies an object, and the modifier specifies the material from which the object is made. This case is especially interesting, because the referent of the head may normally be made from just one kind of material, which is then overridden in the compound.

rokci cinfo = stone lion
snime nanmu = snow man (Hun)
kliti cipni = clay bird
blaci kanla = glass eye (Hun)
blaci kanla = glass eye (Que 'spectacles')
solji sicni = gold coin (Tur)
solji junla = gold watch (Tur, Kor, Hun)
solji djine = gold ring (Udm, Aba, Que)
rokci zdani = stone house (Imb)
mudri zdani = wood house (Ewe 'wooden house')
rokci bitmu = stone wall (Ewe)
solji carce = gold chariot (Skt)
mudri xarci = wood weapon (Skt 'wooden weapon')
cmaro'i dargu = pebble road (Chi)
sudysrasu = cutci straw shoe (Chi)

16. The head specifies a typical object used to measure a quantity and the modifier specifies something measured. The compound as a whole refers to a given quantity of the thing being measured. English does not have compounds of this form, as a rule.

tumla spisa = land piece (Tur 'piece of land')
tcati kabri = tea cup (Kor, Aba 'cup of tea')
nanba spisa = bread piece (Kor 'piece of bread')
bukpu spisa = cloth piece (Udm, Aba 'piece of cloth')
djacu calkyguzme = water calabash (Ewe 'calabash of water')

17. The head specifies an object with certain implicit properties, and the modifier overrides one of those implicit properties.

kensa bloti = spaceship
bakni verba = cattle child (Ewe 'calf')

18. The modifier specifies a whole, and the head specifies a part which normally is associated with a different whole. The compound then refers to a part of the modifier which stands in the same relationship to the whole modifier as the head stands to its typical whole.

kosta degji coat finger (Hun = coat sleeve)
denci genja tooth root (Imb)
tricu stedu tree head (Imb = treetop)

19. The head specifies the producer of a certain product, and the modifier specifies the product. In this way, the compound as a whole distinguishes its referents from other referents of the head which do not produce the product.

silka curnu silkworm (Tur, Hun, Aba)

20. The head specifies an object, and the modifier specifies another object which has a characteristic property. The compound as a whole refers to those referents of the head which possess the property.

sonci manti = soldier ant
ninmu bakni = woman cattle (Imb 'cow')
mamta degji = mother finger (Imb 'thumb')
cifnu degji = baby finger (Imb 'pinky')
pacraistu zdani = hell house (Skt)
fagri dapma = fire curse (Skt 'curse destructive as fire')

21. As a particular case (when the property is that of resemblance): the modifier specifies an object which the referent of the compound resembles.

grutrceraso jbama = cherry bomb
solji kerfa = gold hair (Hun 'golden hair')
kanla djacu e= ye water (Kar 'spring')
bakni rokci = bull stone (Mon 'boulder')

22. The modifier specifies a place, and the head an object characteristically located in or at that place.

ckana boxfo = bed sheet (Chi)
mrostu mojysu'a = tomb monument (Chi 'tombstone')
jubme tergusni = table lamp (Chi)
foldi smacu = field mouse (Chi)
briju ci'ajbu = office desk (Chi)
rirxe xirma = river horse (Chi 'hippopotamus')
xamsi gerku = sea dog (Chi 'seal')
cagyce'u zdani = village house (Skt)

23. Specifically: the head is a place where the modifier is sold or made available to the public.

cidja barja = food bar (Chi 'restaurant')
cukta barja = book bar (Chi 'library')

24. The modifier specifies the locus of application of the head.

kanla velmikce = eye medicine (Chi)
jgalu grasu = nail oil (Chi 'nail polish')
denci pesxu = tooth paste (Chi)

25. The head specifies an implement used in the activity denoted by the modifier.

me.la.pinpan. bolci = Ping-Pong ball (Chi)

26. The head specifies a protective device against the undesirable features of the referent of the modifier.

carvi mapku = rain cap (Chi)
carvi taxfu = rain garment (Chi 'raincoat')
vindu firgai = poison mask (Chi 'gas mask')

27. The head specifies a container characteristically used to hold the referent of the modifier.

cukta vasru = book vessel (Chi 'satchel')
vanju kabri = wine cup (Chi)
spatrkoka lanka = coca basket (Que)
djacu calkyzme = water calabash (Ewe)
rismi dakli = rice bag (Ewe, Chi)
tcati kabri = tea cup (Chi)
ladru botpi = milk bottle (Chi)
rismi patxu = rice pot (Chi)
festi lante = trash can (Chi)
bifce zdani = bee house (Kor 'beehive')
cladakyxa'i = zdani sword house (Kor 'sheath')
manti zdani = ant nest (Gua 'anthill')

28. The modifier specifies the characteristic time of the event specified by the head.

vensa djedi = spring day (Chi)
crisa citsi = summer season (Chi)
cerni bumru = morning fog (Chi)
critu lunra = autumn moon (Chi)
dunra nicte = winter night (Chi)
nicte ckule = night school (Chi)

29. The modifier specifies a source of energy for the referent of the head.

dikca tergusni = electric lamp (Chi)
ratni nejni = atom energy (Chi)
brife molki = windmill (Tur, Kor, Hun, Udm, Aba)

There are some compounds which don't fall into any of the above categories.

ladru denci = milk tooth (Tur, Hun, Udm, Qab)
kanla denci = eye tooth

It is clear that "tooth" is being specified, and that "milk" and "eye" act as modifiers. However, the relationship between "ladru" and "denci" is something like "tooth which one has when one is drinking milk from one's mother", a relationship certainly present nowhere except in this particular concept. As for "kanla denci", the relationship is not only not present on the surface, it is hardly possible to formulate it at all.

Here are some types of compounds where there is no effective difference between the modifier and the head.  In some languages, it is common for these compounds to occur in the opposite order as well.

30. The compound may refer to things which are correctly specified by both components. Some of these instances may also be seen as asymmetrical compounds where the modifier specifies a material.

cipnrstrigi pacru'i = owl demon (Skt)
nolraitru prije = royal sage (Skt)
remna nakni = human-being male (Qab 'man')
remna fetsi = human-being female (Qab 'woman')
sonci tolvri = soldier coward (Que)
panzi nanmu = offspring man (Ewe 'son')
panzi ninmu = offspring woman (Ewe 'daughter')
solji sicni = gold coin (Tur)
solji junla = gold watch (Tur, Kor, Hun)
solji djine = gold ring (Udm, Aba, Que)
rokci zdani = stone house (Imb)
mudri zdani = wooden house (Ewe)
rokci bitmu = stone wall (Ewe)
solji carce = gold chariot (Skt)
mudri xarci = wooden weapon (Skt)
zdani tcadu = home town (Chi)

31. The compound may refer to all things which are specified by either of the compound components.  English does not have compounds of this form, as a rule.

nunji'a nunterji'a = victory defeat (Skt 'victory or defeat')
donri nicte = day night (Skt 'day and night')
lunra tarci = moon stars (Skt 'moon and stars')
patfu mamta = father mother (Imb, Kaz, Chi 'parents')
tuple birka = leg arm (Kaz 'extremity')
nuncti nunpinxe = eating drinking (Udm 'cuisine')
bersa tixnu = son daughter (Chi 'children')

32. Alternatively, the compound may refer to things which are specified by either of the compound components or by some more inclusive class of things which the components typify.

curnu jalra = worm beetle (Mon 'insect')
jalra curnu = beetle worm (Mon 'insect')
kabri palta = cup plate (Kaz 'crockery')
jipci gunse = hen goose (Qab 'housefowl')
xrula tricu = flower tree (Chi 'vegetation')

33. The compound components specify crucial or typical parts of the referent of the compound as a whole.  English does not have compounds of this form, as a rule.

tumla vacri = land air (Fin 'world')
moklu stedu = mouth head (Aba 'face')
sudysrasu cunmi = hay millet (Qab 'agriculture')
gugde ciste = state system (Mon 'politics')
prenu so'imei = people multitude (Mon 'masses')
djacu dertu = water earth (Chi 'climate')

Here are the explanations of the three-letter language-name abbreviations:

Aba = Abazin
Chi = Chinese
Eng = English
Ewe = Ewe
Fin = Finnish
Geo = Georgian
Gua = Guarani
Hop = Hopi
Hun = Hungarian
Imb = Imbabura Quechua
Kar = Karaitic Hebrew
Kaz = Kazakh
Kor = Korean
Mon = Mongolian
Qab = Qabardian
Que = Quechua
Rus = Russian
Skt = Sanskrit
Swe = Swedish
Tur = Turkish
Udm = Udmurt

2009-11-06

More of my blather

If, Ghu help you, you want to see a lot the stuff I'm posting as blog comments rather than saving for my own blog, then this amazing search engine is your friend.  If you, too, post a lot of comments, go to their home page and set up your own profile, giving them the "web address" you use to post, and everyone can see many of the places you are posting to as well.  Very, very nice.

Update: Alas, this service is dead.

2009-10-24

More female programmers

I tried to post this comment to a public site, but failed repeatedly. The topic of the original post isn't relevant to my comment, which was in response to a comment that read, in its entirety:

Why would we would want more female programmers?

My answer:

The world needs more effectively mobilized brains. We can't afford to constrain ourselves on what size or shape or color the bodies are that house those brains. Also, diversity is good in itself: it improves flexible response, and it's silly to throw away a cheap source of diversity.

A major U.S. university with a strong CS program (I am contractually prevented from naming it) that had female CS undergraduate admissions in the single digits year after year was able to raise their admission to the same rate as other engineering programs by changing just one thing: they no longer gave people who already had programming experience preferential admission. There have been no changes in the overall performance of the student body in the years since.

2009-10-22

"Omnilingual"

This is to announce my edited version of H. Beam Piper's classic story of linguistic archaeology on Mars, "Omnilingual". Why edit a classic? Here's my Editor's Introduction:

H. Beam Piper's 1956 story "Omnilingual" is one of the few, and still one of the best, science fiction stories in which the science is linguistic archaeology. While the meat of the story holds up marvelously fifty years later, the particulars are firmly rooted in the 1950s. Everyone smokes like a chimney — on Mars! The women are called girls, and their gender is mentioned at every conceivable opportunity. All the work is still done with pencil and paper and sketching boards and looseleaf notebooks.

My edits, then, are intended to modernize the work, to help the 2009 reader not stumble over the details. Notebooks are computerized; sketchbooks have been replaced by tablets. Gender equality and the metric system are taken for granted. Smoking isn't even mentioned. I wedged in a mention of the Classic Maya decipherment of the 1980s (a counterexample to the story's thesis!), but let one of the characters dismiss it as irrelevant. I set the story, as Piper did, forty years in the future, but that is now 2049 rather than 1996. There are fewer This Is Science Fiction flags, so "Earth" instead of "Terra", "U.N." instead of "Federation Government".

Piper's Mars and his Martians are completely impossible based on what we know of Mars today. Rather than trying to change all that, which would have involved wholesale destruction and re-invention, I have changed the planet's name to Ares after the Greek rather than the Roman god of war. The intention is to suggest someplace analogous to Mars as we know it in 2009, but different in detail. The atmosphere on Ares is thin, but breathable with supplementary oxygen; the humidity, while low, supports plenty of life forms. As for the too-human Martians (or Areans), I have made them an offshoot of Homo sapiens whose presence on the fourth planet from the sun remains a mystery.

However, the characters, the plot, the underlying logic remain the same. Hopefully I haven't damaged the story too much in trying to adjust it to modern taste. Those who prefer the original form can easily find it at Project Gutenberg, who provided the public-domain base text from which this revision was made. They also have the original Frank Kelly Freas drawings, which I didn't feel right about using -- they were made in the 1950s, too, and no longer seemed to fit the revised text.

Read and enjoy!

2009-10-16

Fragments

David Moser's relentlessly self-referential story "This Is the Title of This Story, Which Is Also Found Several Times in the Story Itself" begins simply enough with the fairly ordinary sentence "This is the first sentence of this story."

But by the fourth paragraph, a harbinger of what is to come: "Introduces, in this paragraph, the device of sentence fragments. A sentence fragment. Another. Good device. Will be used more later."

True enough. "Incest. The unspeakable taboo. The universal prohibition. Incest. And notice the sentence fragments? Good literary device. Will be used more later."

A later passage from the same increasingly disconnected tale: "Bizarre. A sentence fragment. Another fragment. Twelve years old. This is a sentence that. Fragmented. And strangling his mother. Sorry, sorry. Bizarre. This. More fragments. This is it. Fragments. The title of this story, which. Blond. Sorry, sorry. Fragment after fragment. Harder. This is a sentence that. Fragments. Damn good device."

Still further down: "The purpose. Of this paragraph. Is to apologize. For its gratuitous use. Of. Sentence fragments. Sorry. "

And then: "Or this sentence fragment? Or three words? Two words? One?"

Getting near the end: "By the throat. Harder. Harder, harder."

Lastly: "This is."

Read. The whole thing. Worthwhile. NSFW, technically.

2009-10-01

Why Are PHBs Stupid?

Mark Liberman on Language Log asks:
However we decide to define "manager", this group is certainly now the object of a complex of negative stereotypes. When and how did this start? I don't know, and I welcome suggestions. These attitudes may be connected to the antique European aristocratic disdain for those who are "in trade", and to the (I think related) modern intellectual disdain for the world of business. These attitudes seem to have been imported from the intelligentsia into industry through the medium of engineers and especially programmers, who (at least at lower levels) maintain a very different culture from the "suits" in finance, marketing, product planning, and so on.
I think Mark's right to speak of "engineers and especially programmers", and I think the key phrase is "maintain a very different culture". Historically, the boss that most people dealt with was the foreman, which the OED defines in the relevant sense as "the principal workman; specifically, one who has charge of a department of work." You began by doing the work, and if you got good at it, you ended up telling other people with less experience or less competence how to do it instead. This could go right up to the top: Thomas Edison began as an inventor, and wound up running a huge "invention factory", the first modern industrial research lab.

Two factors undermined this, though: the sense that promoting high-quality workmen instead of continuing to take advantage of their work made no sense, and the idea that management was or could be a profession abstracted from the particular work being managed. The first factor appeared particularly strongly in computer programming because of the huge disparity in productivity: the best programmers are literally two orders of magnitude more productive than the average. Losing a top steelworker to foremanship might cost the company the labor of 2-3 standard steelworkers, but losing the productivity of 100 merely competent programmers seemed insane. And of course geeks tend to like their jobs, and to be uninterested in (and incompetent at) people-managing. Companies had to deal with the widespread appearance of workers who did not want to be promoted, ever.

At the same time, the rise of the MBA spread the meme among the suits that managing people was a learned profession like law or medicine or engineering, where you primarily apply what you have learned from books, courses, etc. to the requirements of the job. Before that, management had always been seen as a job, like digging ditches or being President of the United States: you can prepare for it to some extent, but mostly you do a job by applying whatever you have to whatever you need to do.

Making management a profession was arguable; the associated notion that you could manage workers with no understanding of what they did was a disaster. Computer programmers were in the forefront of knowing what had happened: they quickly saw that their bosses had no idea of how the work was done, the necessary conditions for doing it, or the difference between what could be done, what could be done with extraordinary effort, and what could not be done at all. The boss had always been seen as a mean fellow (after all, he tells you what to do and can fire you), but now he also appeared clueless and even stupid, someone who could not be made to understand no matter what.
None of the early citations in the OED, nor the quotes that I find in LION, seem to reflect the modern Dilbertian managerial stereotype. That stereotype clearly predates Dilbert — but when did it arise? and where did it come from?

In this context, we have to return to Andrew's question: What is a manager, anyhow? By now, I suppose that the Dilbert empire employs a certain number of people, whom Scott Adams in some sense manages — does he thereby consider himself a "manager" in the relevant sense?
Scott Adams is not only a manager now, he has always been one by training: he was an economics major, not any kind of scientist or engineer, and he got an MBA before he worked with his first geek. He is extraordinarily observant (especially for an MBA, I add snarkily) and he actually does grasp how geeks think, but despite appearances he basically sees them from the outside. When I discovered this, the shock was so great that I started to see him as an outsider mocking my culture rather than an insider mocking its excesses (though to be sure Dilbert is harder on suits than on nerds), and I lost interest in the strip completely.

(Note: Even though Mark says he's been a manager since 1980, I think that industrial research and academia still basically run on the old model, and therefore their managers, including him, are mostly exempt from the trend I am reporting here.)

2009-09-21

Common Lisp symbols bound in more than one namespace


These are the Common Lisp symbols which are bound in more than one namespace:  for example, + is both a function (addition) and a variable (the most recent form evaluated by the REPL).  The links point into the Common Lisp Hyperspec.


2009-05-24

Two Kinds

There's two kinds of people in the world: those who divide the world into two kinds of people, and those who don't.

There's three kinds of people in the world, those who can count and those who can't.

There's 10 kinds of people on the world, those who can do binary and those who can't.

There's 10 kinds of people in the world, those who understand trinary, those who don't understand trinary, and those who mistake it for binary.

And, of course, there's two kinds of people in the world, those who can tell a joke, and those who can't.

Or perhaps there are really three kinds, those who can tell a joke, those who can't, and those who can but run it into the ground.

But Little Anthony and the Imperials said it best.

2009-05-09

No more anonymous comments; sorry.

I just had to clear out about 100 anonymous spam comments, and Blogger doesn't make that easy. So no more anonymous comments. Sorry. But you can still comment on any posting, no matter how old.

2008-12-26

Recycled Nursery Rhymes and Songs for Secular Babies

Here are a few things I sing to Dorian (who is now six months old) along with more conventional fare like "Guten Abend, gute' Nacht" (the Brahms Lullaby), "You Are My Sunshine", and "Veni, veni Emmanuel":

Air: Three Blind Mice
Dor-i-an, Dor-i-an
See who I am, see who I am,
I am the Drool- and the Burpinator,
I am the Fart- and the Poopinator,
I am the Squeal- and the Howlinator,
I'll be baaaack, I'll be baaaack.

Air: Puttin' on the Ritz
Who's that baby, what is he doin'
He's my grandson, he is a-chewin'
Dor-i-an . . . Chewin' on his bib.

Who's that baby, where is he goin'
I don't know and there is no knowin'
Dor-i-an . . . Chewin' on his bib.

Air: Jesus Loves Me
Grandpa loves me, this I know,
'Cause his caring tells me so,
Little me with him belongs,
Till I'm bold and brave and strong.

Yes, Grandpa loves me (3x)
His caring tells me so.

(This gets changed to Grandma or Mommy or even Grownups on occasion.)

Air: Deck the Halls
Fast away the bottle's draining,
  Do-do-do-do-do, do-do-ri-an.
On the bib the drips are raining,
  Do-do-do-do-do, do-do-ri-an.
Soon the back we will be pounding,
  Do-do-do, do-do-do, Do-ri-an.
And the burps will be resounding,
  Do-do-do-do-do, do-do-ri-an.

Air: Tell Me Why
Tell me why the stars do shine,
Tell me why the ivy twines,
Tell me why the sky's so blue,
Tell me, oh tell me, just why I love you.

Nuclear fusion makes stars to shine,
Tropism makes the ivy twine,
Scattering makes the sky so blue,
Gonadal hormones are why I love you.

(This is the only one I didn't make up myself.)

2008-10-14

Converting Restricted XML to Good-Quality JSON

Here's some ideas for converting restricted forms of XML to good-quality JSON. The restrictions are as follows:
  • The XML can't contain mixed content (elements with both children/attributes and text).
  • The XML cannot depend on the order of child elements with distinct names (order dependence in children with the same name is okay).
  • There can't be any attributes with the same name as child elements.
  • There can't be any elements or attributes that differ only in their namespace names.
You also need to know the following things for each child element:
  • Whether it MUST appear at most once (a singleton element) or MAY appear more than once (a multiplex element).
  • Whether it only contains text (an element with simple type) or child elements and/or attributes (an element with complex-type).
Now, to convert the XML to JSON, apply these rules recursively:
  • A singleton element of simple type, and likewise an attribute, is converted to a JSON simple value: a number or boolean if syntactically possible, otherwise a string.
  • A multiplex object of simple type is converted to a JSON array of simple values.
  • A singleton element of complex type is converted to a JSON object that maps the local names of child elements and attributes to their content. Namespace names are discarded.
  • A multiplex element of complex type is mapped to a JSON array of JSON objects that map the local names of child elements and attributes to their content. Namespace names are discarded.
Comments are very welcome.