2005-07-25

GPLed software is not Debian-free

Really. The GNU GPL violates the Debian Tentacles of Evil test, because it is a bare license, and (with proper notice) a licensor can revoke a bare license at any time. At most the people already relying on the license may be able to use the legally tricky doctrine of promissory estoppel to go on relying on it. Everyone else is SOL.

2005-07-22

Monochrome and Gentium

As you can see, I've switched my blog to monochrome. I've also switched to using the Gentium font if you have it. If you don't, do yourself a favor and download it (a ZIP archive; for the Mac and Linux RPM formats, see the download page).

Unicode and fonts

This piece is too big for a blog entry, so it's on my web site.

2005-07-21

18th century chitchat at Samuel Johnson's table

SIR A [Alexander Macdonald]: "I think, Sir, almost all great lawyers, such at least as have written upon law, have known only law, and nothing else."

JOHNSON: "Why no, Sir; Judge Hale was a great lawyer, and wrote upon law; and yet he knew a great many other things; and has written upon other things. Selden too."

SIR A: "Very true, Sir; and Lord Bacon. But was not Lord Coke a mere lawyer?"

JOHNSON: "Why, I am afraid he was; but he would have taken it very ill if you had told him so. He would have prosecuted you for scandal."

BOSWELL: "Lord Mansfield is not a mere lawyer."

JOHNSON: "No, Sir. I never was in Lord Mansfield's company; but Lord Mansfield was distinguished at the University. Lord Mansfield, when he first came to town, 'drank champagne with the wits,' as Prior says. He was the friend of Pope."

SIR A: "Barristers, I believe, are not so abusive now as they were formerly. I fancy they had less law long ago, and so were obliged to take to abuse, to fill up the time. Now they have such a number of precedents, they have no occasion for abuse."

JOHNSON: "Nay, Sir, they had more law long ago than they have now. As to precedents, to be sure they will increase in course of time; but the more precedents there are, the less occasion is there for law; that is to say, the less occasion is there for investigating principles."

SIR A: "I have been correcting several Scotch accents in my friend Boswell. I doubt, Sir, if any Scotchman ever attains to a perfect English pronunciation."

JOHNSON: "Why, Sir, few of them do, because they do not persevere after acquiring a certain degree of it. But, Sir, there can be no doubt that they may attain to a perfect English pronunciation, if they will. We find how near they come to it; and certainly, a man who conquers nineteen parts of the Scottish accent, may conquer the twentieth.

"But, Sir, when a man has got the better of nine tenths he grows weary, he relaxes his diligence, he finds he has corrected his accent so far as not to be disagreeable, and he no longer desires his friends to tell him when he is wrong; nor does he choose to be told. Sir, when people watch me narrowly, and I do not watch myself, they will find me out to be of a particular county [Staffordshire]. In the same manner, Dunning may be found out to be a Devonshire man. So most Scotchmen may be found out.

"But, Sir, little aberrations are of no disadvantage. I never catched Mallet in a Scotch accent; and yet Mallet, I suppose, was past five-and-twenty before he came to London."

     --Boswell's Life of Johnson for 1772

Before I hear about it for "Scotch" and "Scotchman", I will point out that Boswell, a Boswell of Auchinleck and most certainly a Scot of the Scots, uses these forms himself.

Wireless hackers ca. 1903

This article illustrates neatly that our predecessors of a century ago were much like us. I was especially struck by the vadding needed to lay down an amateur (wired) telegraph line, the "my transmitter has more kW than yours" battle, and the "I can't remember his name, but his callsign is ...".

Two paternosters in Scots

A modern version:

Faither o' us aa, bidin abune,
thy name be halie.
Let thy reign begin.
Thy will be dune,
on the erthe, as it is in Hevin.
Gie us ilka day oor needfu fendin
an forgie us aa oor ill-deeds,
e'en as we forgie thae wha dae us ill
as lat us no be testit,
but sauf us frae the Ill-Ane,
for the croon is thine ain,
an the micht,
an the glorie,
for iver an iver.

A more traditional version:

Oor faither in heiven
hallowt be thy name.
Thy kingdom come,
Thy will be dune,
on the yird as in heiven.
Gie us oor breid for this incomin day.
Forgie us the wrangs we hae wrocht,
as we hae forgien the wrangs we hae dree'd.
An say us na sairlie.
But sauf us frae the ill-ane.
And thine be the kingdom,
the pooer, an the glorie,
noo an forivver. Amen.

Note in each case the semantics of the Seventh Petition: not "deliver us from evil", but specifically "deliver us from the Evil One."

Old Irish satire

Although Westerners have pretty well given up believing that the name is the thing, or even that there's some kind of resonance between names and things, it hasn't always been so. My ancestors were about as western as you can get in Europe, and they used names to induce malefactors to commit suicide: "sticks and stones may break my bones, but names can really mess me up."

Here are a few examples of Old Irish satire, that most feared of all arts:

A satire on Bres, "the first satire made in Ireland", from the story of the Second Battle of Mag Tuireadh.

Cen cholt for crib cernini,
Without food quickly on a dish,
cen gert ferbba fora n-asa aithrinni,
without cow's milk on which a calf thrives,
cen adba fir iar ndruba disoirchi,
without a man's dwelling after the staying of darkness,
cen dil daimi rissi,
without a storyteller band's payment,
ro sain Brisse.
so be Bres.

A "poem to raise blisters" from Cormac's Glossary:

Maile baire gaire Caier
Evil, death, short life to Caier
combeodutar celtra cath Caier
May battle spears slay Caier
Caier diba Caier dira Caier foro
Caier by land, Caier by earth, Caier rejected
fomara fochara Caier.
Under mound, under rocks, Caier.

And a single line from a much longer satire; each word means "I will satirize" using three different roots:

gromfa gromfa glamfa glamfa aerfa aerfa

I suspect that these curses probably did "work" in their context: an intensely shame-oriented (as opposed to guilt-oriented) culture in which people identified extremely strongly with their public reputations, to the point where the destruction of that reputation could send them into overwhelming depression. (God knows it isn't so today: the Irish have acquired an overwhelmingly guilt-oriented culture somewhere along the way.)

I often wonder if Japan might have been able to escape the spiral of events leading to World War II if they had institutionalized public satire as an escape valve.

Go away, you silly samurai, or I will satirize you ... a second time!

2005-07-17

Problems with quantitative reasoning

I'm not a linkblogger, really I'm not, but this ham tale gave me something I badly needed lately: a really thoroughgoing laugh attack, the kind where you check yourself at the end to see if you need to change your clothes.

Of course, the error in quantitative reasoning was in supposing that a Quarter-Pounder contains 1/4 lb of hamburger after it's cooked: it contains maybe a third of that.

2005-07-14

Regularized Inglish

I know this looks like a whole mess of misspellings, but it's actually a very sensible spelling reform (not revolution) devised by Axel Wijk and published in his book Regularized Inglish back in 1959, and simplified slightly by me. This is the American version.

Wunce upon a time thare livd a poor boy named Dick Whittington, hooze faather and muther were bothe ded. Having neether home nor frends, he roamed about the cuntry trying to ern hiz living. Sumtimes he cood not finde any wurk, and he offen had to go hungry.

On market days he herd the farmers tauk about the greit city of Lundon. They sed that its streets wer paved with gold. So Dick made up hiz minde to go to Lundon and seek hiz fortune. Packing hiz clothes into a bundle and cauling hiz faithful cat he started out. After days and days ov wauking, the hungry lad finally reached Lundon.

But alas, the streets were not paved with gold but hard cobblestones. He wondered [i.e. wandered] about the city seeking for wurk. At last, he came to the house ov a rich merchant and knocked at the dor.

The dor woz opened by the cook, but when she oenly saw a ragged boy on the step, she woz angry and told him to begon. At that moment the oener ov the house, Master Fitzwarren, returned and seeing the ppor boy's condition he took pity on him and ordered the cook to giv him sum foode. "If yoo wish to wurk", he added, "yoo may stay here and help cook in the kitchen. Yoo will finde a bed in the attic." Dick thanked Master Fitzwarren very much for hiz greit kindeness.

Dick might hav been happy had it not been for the cook, hoo whipped him aulmoste every day. She treated him so badly that the merchant's daughter, hoo woz a kinde-harted girl, felt very sorry for the lonely lad.

Wun day the merchant cauld aul hiz servants together. He told them that he had a ship reddy to sail to forren lands, and that each ov them might send sumthing in her, and they shood hav aul that it sold for.

"What ar yoo going to send, Dick?" asked the merchant's daughter.

"I hav nuthing to send," sed Dick sadly, "nuthing but my cat."

"Fetch thy cat then, boy, and send her!" sed the merchant.

Dick woz sorry to part with Poossy, yet he obeyed his master, and with tears in hiz ies gave Pooss to the capten ov the ship.

Aultho Dick wurked hard and tried to pleaze the cook, she continued to beat and torment him. At last he cood not stand it eny longer and made up hiz minde to run away. Wun morning he got up very erly, and packing hiz few things into a tiny bundle, he slipped out ov the house. When he got az far az a place cauld Highgate, he felt tired and sat down thare to rest. Suddenly the bells ov Boe Church began ringing and az he lissened it seemed to him that they wer saying:

Ding-dong, ding-dong,
Turn agen, Whittington,
Thrice Lord Mayor ov Lundon!

"Lord Mayor ov Lundon", he said to himself. "Hoo wood not be Lord Mayor ov Lundon? But if I run away I'll never hav a chance. I'll go back again and endure aul the cook's beatings raather than miss such a chance." Back he hurried and managed to get into the house before the cook had cum down.

While aul this woz happening, the ship with Dick's cat woz bloen by a storm to a distant cuntry inhabited by Moors. Theze peeple receeved the capten and hiz men kindely and wer anxious to see whot the straingers had in their ship. The capten shoed [i.e. showed] them hiz good and aulso sent sum samples to the king ov the cuntry.

The King woz so well pleazed with the samples that he invited the capten to hav dinner with the King and Queen. Az soon az the dishes wer braught in and poot down on the table, an immense number ov rats rushed out from every side and sworming over the foode, ate it nearly aul up. The capten woz amazed at this and asked the King how he cood stand such a thing.

"But whot can I doo to stop them?" sed the King. "I wood gladly giv haf my kingdom to get rid ov theze pests."

Then the capten thaught of Dick's cat and told the King that he had a little animal on hiz ship that wood make short wurk ov theze creatures.

"Go bring this wunderful animal to me," cried the King, "and I will load yoor ship with gold and jewels in exchainge for her."

The capten hurred off while anuther dinner woz being prepared, and when he returned with Pooss, the rats wer bizzy eating that aulso. Down amung them he put Pooss and she flew around killing a greit number, while the rest ran away.

The King and Queen wer overjoyed to see their enemies thus dispersed, and when the capten sed that he wood be onnord if they wood allow him to make them a prezent of Pooss, the King woz so delighted that he baught aul the ship's cargo and gave ten times az much for the cat.

The ship then sailed back with fair winds to Ingland. On arriving home the capten went to the merchant and shoed [i.e. showed] him aul the trezures that the King had given for Pooss. The onnest merchant at wunce sent for Dick and congratulated him on having becum a rich man. "Yoor cat haz braught yoo more munney than I pozess," he sed. "May yoo liv long to enjoy it."

Dick fell on hiz knees and thanked Heven for hiz good fortune. He then reworded [i.e. rewarded] the capten and the crew and aulso gave prezents to aul the servants, even to the cross-tempered cook.

Later on Dick married hiz master's daughter and the yung cupple lived long and happily. The prophecy that the bells ov Boe Church had chimed in the ears ov the ragged boy later came true. Three times woz Dick Whittington Lord Mayor ov Lundon.

Question and answer

On a mailing list I was administering, a user enquired:

All quiet: no whines,
no rants, no eye-glazing screeds;
Was I unsubscribed?

I replied:

No, not at all, Sir.
Our view of you is unchanged.
Fix your spam filters.

Irregularities

Most of this material comes from Steven Pinker's book Words and Rules".

There are about 165 inherited irregular verb roots in English (for example, see, saw, seen), and maybe 35 irregular noun roots (for example, foot, feet). This does not count the Latin and Greek plurals, which we typically learn in school and don't acquire with the rest of the language.

In English, the regular nouns and verbs are the most common kind, but this isn't true in some other languages: In closely related German, for example, the overwhelming majority of nouns are irregular (the regular ending -s, although applicable to all sorts of nouns, is quite rare), and there are far more irregular verbs than in English.

Similarly, the noun classifier system in Chinese and other languages operates quite analogously to irregular noun plurals in other languages; there is a regular classifier ge, and then there are lots of fuzzily defined families of nouns, each with its specific classifier. These families tend to be organized on semantic lines, but with lots of exceptions.

For example, the Chinese classifier for human being has a respectful tone, and the word for thief doesn't normally take it, using the regular classifier instead. It's a defect in most Chinese dictionaries that they don't list the most usual classifier for a noun, in the way that French dictionaries show gender.

In Japanese, which has the same kind of classifier system, the classifier for book remains the one for vertical cylinders, despite the prevalence of codices over scrolls for some generations now. In Burmese, where nouns can almost always be used with more than one classifier, a semantic explanation tells us why basket of cows is a forbidden combination, but does not explain why a team of horses is also forbidden: one must refer to the fact that Burmese do not happen to use teams of horses.

It makes a great deal of difference whether a word is regular or irregular: compounds formed from them obey quite different rules. A compound or idiom whose head has an irregular root is irregular: overate, undid, bogeymen, stepchildren, milk teeth, straw men, oil mice, beewolves (a kind of wasp), cut a deal, bought the farm, caught cold, went bananas, threw up. Note that He threw up the ball (not an idiom) and He threw up his lunch (idiomatic) are syntactically indistinguishable; in either case, the up can be postposed.

However, a bahuvrihi compound is regular: tenderfoots, sabertooths, lowlifes, flatfoots, still lifes. Walkmans is also headless, though not technically bahuvrihi (if it were, it would mean "one who walks like a man" or something similar).

Rootless nouns and verbs made from names, quotations, sounds, abbreviations and foreign words are regular: I've been Rolling Stoned and Beatled till I'm blind, There are five "man"s on that page, The tire made several pffffts.

Denominal verbs where the verb is derived from an irregular noun are nevertheless regular: stringed means 'having had a string removed', despite the verb string, strung underlying the noun string, and to be put out (in baseball) by reason of hitting a fly ball which is caught is to be flied out, not flew out, because of the intervening noun fly 'fly ball'. Likewise, The doctor slided the sample means he put it on a microscope slide.

Regularly inflected nouns can't normally be incorporated into compounds: mice-eater is acceptable, rats-eater is unacceptable. Exceptions arise when the plural form refers to a heterogeneous collection of individuals treated as distinct entities: a dog hater needn't hate any particular dog (he need not even have met a dog), but an enemies list is a list of specific, individual enemies. Likewise with Landmarks Commission, singles bar, the Morphemes Project. Pinker collects these and puts them on his (what else?) exceptions list.

2005-07-08

Loyal opposition

In a post on WS-* last year, Tim Bray used the phrase "Her Majesty's Loyal Opposition", saying:

The idea is, they Oppose the Government but are Loyal in that they promise not to lead a mob with pitchforks to string them up; and they stand ready to provide an alternative.

Historically, the Opposition was the Loyal opposition not because it was against overthrowing the monarchy, but because in George III's day there were the "King's friends" and then there were the others, who most certainly did not want to be thought of as the "King's enemies". The term "Loyal Opposition" served to deflect the antagonism of the Crown to those who did not support its policies, whatever they happened to be. Later, when the Crown became a captive of whoever won the elections, the sense of the term shifted a bit.

But hang in there loyally opposing. Those who take pleasure in common sense and sound design support you.

Our HTML is XML

Tim Bray wrote:

There seem to be two kinds of XML projects: those where they send some emails and examples back and forth and are now in production, and those where they strike a task force to assemble the schemas, and the project is still in committee stage.

I replied that there is a third road, the Fascist-Publisher approach:

All our news articles conform to one of these two DTDs. One is a subset of XHTML with some additional elements, the other is IPTC NewsML. Take it or leave it.

The only problem with this approach is the following too-often-repeated dialogue:

Me: We can provide you with news in HTML.

Hapless Customer: Don't you have XML?

Me: Our HTML news is well-formed XML. [It would be valid, but there is no DOCTYPE declaration, to prevent hammering on the DTD on our web server.]

HC: But we want XML, not HTML!

Me: As I say, our HTML news is well-formed XML. [Noch einmal]

Orgon son of Ubu

The original French:

Ondoyons un poupon, dit Orgon, fils d'Ubu. Bouffons choux, bijoux, poux, puis du mou, du confit ; buvons, non point un grog : un punch. Il but du vin itout, du rhum, du whisky, du coco, puis il dormit sur un roc. L'infini bruit du ru couvrit son son. Nous irons sous un pont oĂą nous pourrons promouvoir un dodo, dodo du poupon du fils d'Orgon fils d'Ubu.

Un condor prit son vol. Un lion riquiqui sortit pour voir un dingo. Un loup fuit. Un opossum court. Où vont-ils? L'ours rompit son cou. Il souffrit. Un lis croût sur un mur : voici qu'il couvrit orillons ou goulots du cruchon ou du pot pur stuc.

Ubu pond son poids d'or.

In English translation:

"I'm going to rock this child in his cot," sighs Orgon, son of Ubu. "I'm going to wolf down mutton, broccoli, dumplings, rich plum pudding. I'm going to drink, not grog, but punch." Orgon drinks hock, too, rum, Scotch, plus two hot brimming mugs of Bovril to finish up with, which soon prompts him to nod off. Running brooks drown out his snoring. I stroll to rocks on which I too will nod off, with Orgon's dozing son, with Orgon, son of Ubu.

Condors swoop down on us. Poor scrofulous lions slink out, scrutinizing dingos with scornful looks. Chipmunks run wild. Opossums run, too, without stopping. North or south? I wouldn't know. Plunging off clifftops, bison splits limb in two. It hurts. Ivy grows on brick, rising up from stucco pots to shroud windows or roofs.

From Ubu's bottom drops his own bulk in gold.

2005-07-07

Hello! I am an XML encoding sniffer

I am an algorithm that sniffs at byte streams that purport to be XML documents to figure out what character set is used to encode them. I start by checking the first four bytes of the stream to assign a tentative encoding. If I see:

  • 0xEF 0xBB 0xBF, I assign "UTF-8";
  • 0xFF 0xFE or 0xFE 0xFF followed by anything but 0x00 0x00, I assign "UTF-16";
  • 0x4C 0x6F 0xA7 0x94, I assign "EBCDIC-unknown";
  • Otherwise, I assign "unknown".

If the tentative encoding is "UTF-8", I return it. Otherwise I then read forward, ignoring all 0x00 bytes, until I find either a 'g' (0x67, or 0x87 on the EBCDIC code path) or an '>' (0x3C, or 0x4C on the EBCDIC code path).

In the former case, I sniff further for an apostrophe (0x27, or 0x7D on the EBCDIC code path) or a double quotation mark (0x22, or 0x7F on the EBCDIC code path). I then collect the encoding name following it until the next apostrophe or quotation mark, always ignoring 0x00 bytes, and return it. (On the EBCDIC code path, I need to translate it from invariant EBCDIC to ASCII first.)

In the latter case, there is no encoding declaration, and I return "UTF-16" if the tentative encoding was "UTF-16" or "UTF-8" otherwise. (On the EBCDIC code path, this is an error.)

Then someone else starts over from the beginning of the byte stream, decoding and parsing. I may return erroneous results if the document is not well-formed XML, but in that case there will certainly be errors detected by the parser.

Extreme 2005

I'll be presenting two tutorials at Extreme Markup 2005: one on Unicode, and one on RESTful Web Services. Please come to Extreme and sign up for these, because if there aren't enough people, they'll be canceled and I won't get there and that would be baaaaaaaad.

Disclaimer

Almost everything I post here is coming from my outgoing mail archive, so if you use Google you can often find the original context. I do make improvements (corrections, clarifications, and so on), and these are the small subset of my postings I consider worth reading outside that context. I'm working through the archive from 1999 forward, so eventually I'll get to the fairly recent stuff.

Hey, I said it's recycled knowledge, but I didn't say how many times it had been recycled. One of the reasons I dropped off a lot of mailing lists at the end of 2004 was that I found I was repeating the same postings I had made years ago.

2005-07-06

Legacy

Anything that is actually implemented and deployed is by definition legacy. Googling for "legacy Java code" (exact phrase) today produced 444 hits, "legacy Java" by itself almost 30,000, and even "legacy XML" 619.

As usual, Frederick Brooks called it correctly thirty years ago in The Mythical Man-Month, chapter 1

[T]he product over which one has labored so long appears to be obsolete upon (or before) completion. Already colleagues and competitors are in hot pursuit of new and better ideas. Already the displacement of one's thought-child is not only conceived, but scheduled. [...] As soon as one freezes a design, it becomes obsolete in terms of its concepts.

Of course, he also says, with understatement that is less than usual in our profession:

The new and better product is generally not available when one completes his own: it is only talked about [...]. The real tiger is never a match for the paper one, unless actual use is wanted. Then the virtues of reality have a satisfaction all their own.

"th" as in "then"

English th represents two different sounds, the sound of th in thin and the sound of th in then. The two sounds are quite rare in the world's languages: only Greek and Icelandic among the European languages have both. Although they are as different as f and v or s and z, there has never been any way of distinguishing them in English orthography (in the IPA they are written Θ and ð respectively). This isn't as bad as it seems, because in fact the second sound is used only in a minority of words, which basically fall into four categories:

  1. in native English words between vowels
  2. in native English words at the end where there used to be a vowel that has been lost, often with a silent -e that represents it
  3. at the beginning in closed-class words
  4. just before a final m with no vowel separating them

Examples of the first group are: bother, brethren (formerly bretheren), brother, either, farther, father, fathom, feather, further, gather, hither, leather, mother, neither, nether, other, rather, slither, smithereens, smithy, smother, swarthy, together, weather, (bell)wether, whether, (where)withal, wither, withershins (a variant of widdershins, a rare word meaning 'counter-clockwise') worthy, and their inflected and derived forms. Either appears on this list, but ether does not, because it is borrowed from Greek. Thither belongs to both this group and the third group.

Examples of the second group are: bathe, bequeath, betroth, blithe, breathe, clothe, lathe, lithe, loathe, scythe, seethe, smooth, soothe, teethe, tithe, withe, wreathe, writhe and their inflected and derived forms. The verb mouth (not the noun mouth) also belongs to this category.

Examples of the third group are: than, that, the, their(s), them, then, thence(forth), there (and compounds), these, they, this, those, (al)though, thus, thy(self). All these words belong to closed classes, grammatically similar groups of words that aren't easily added to English, such as pronouns, conjunctions, and articles. Open classes like nouns, verbs, and adjectives don't normally have this sound at the beginning of the word: thigh is a noun and has the first th sound, whereas thy is a pronoun and has the second.

Examples of the fourth group (the only ones I can find) are algorithm, logarithm, and rhythm.

There are very few minimal pairs for these two th sounds; that is, pairs of words which are distinguished only because one has the first and the other the second sound. Thigh and thy, mentioned above, form perhaps the best-known pair. For some English-speakers, thin and then are likewise a minimal pair, because they do not distinguish between short e and short i before m or n.

Earlier varieties of English (and perhaps in some dialects still) had no phonemic voicing contrasts in fricatives. Since then, all fricative phonemes save h have split (f and v, s and z, Θ and ð, and ʃ and ʒ) but to varying degrees, and even now there are not very many minimal pairs for any of the four.

2005-07-04

Borders, lots of borders

Which country is bordered by the greatest number of other countries?

It's a close race. Russia borders on Norway, Finland, Estonia, Latvia, Lithuania (via its Kaliningrad exclave), Poland (via Kaliningrad), Belarus, Ukraine, Georgia, Azerbaycan, Kazakhstan, China, and Mongolia, and is separated by only a few kilometers of water from Japan (in the Kuriles) and the U.S. (the Diomedes in the Aleutian chain).

Update: Russia also borders on North Korea, a border so short I missed it before, and (amazingly enough) on Switzerland: there is a monument to the Russian general Alexander Suvorov near Göschenen in central Switzerland that is legally Russian territory.

China borders on North Korea, Russia, Mongolia, Kazakhstan, Kyrgyzstan, Tajikistan, Afghanistan, Pakistan, India, Nepal, Bhutan, Myanmar, Laos, and Vietnam. If you call Taiwan a country, the Taiwanese island of Jīnmén (better known as Quemoy) is also only a few kilometers from China.