Jonathan Swift on DDoS

Jonathan Swift in the "Drapier's Letters" back in 1724, writing on what is now called distributed denial of service:

It is true, indeed, that, within the memory of man, the parliaments of England have sometimes assumed the power of binding this kingdom [Ireland] by laws enacted there; wherein they were at first openly opposed (as far as truth, reason, and justice are capable of opposing) by the famous Mr. Molyneux, an English gentleman born here, as well as by several of the greatest patriots and best whigs in England; but the love and torrent of power prevailed.

Indeed the arguments on both sides were invincible. For, in reason, all government without the consent of the governed, is the very definition of slavery: but, in fact, eleven men well armed will certainly subdue one single man in his shirt. But I have done; for those who have used power to cramp liberty, have gone so far as to resent even the liberty of complaining: although a man upon the rack was never known to be refused the liberty of roaring as loud as he thought fit.

Historical note: it was on this precedent that the American colonies founded their claim not to be governed by the English Parliament; they gave evidence of their refusal by dumping taxable tea into Boston Harbor.

North American Federation

First, go read Tim Bray's post. What follows will make no sense without it. (I originally sent it as an email to Tim.)

Hey, wonderful! I'm all for it.

It got me to thinking about symbolism for the Federation. How 'bout we use the Maple Leaf Flag, but with the red bars changed to blue? This would keep the "red, white, and blue" symbolism important to Unitedstatesians, but the outline would be that of Canada's flag.

Then we could use America the Beautiful as the national anthem (far better than either official anthem both in lyrics and melody, in my opinion). I think verses 1, 2, 7, and 4 of the final 1913 version, in that order, are the keepers, and the blue bars in the new flag would resonate with "from sea to shining sea".

Politically, the Federation keeps a prime-ministerial system with an elected but ceremonial head of state, like all sensible democratic countries. We'd have 16 states, 10 provinces, 2 commonwealths (Massachusetts and Pennsylvania), 3 territories, and one free city (Washington City). We'd also of course have New York City, the financial capital of the planet Earth.

As for the Republic wanting access to the Pacific, let them buy a border province or two from Mexico. "How many Texans does it take to screw in a lightbulb?" "Texans don't screw in lightbulbs; they go to Mexico."

On a more serious note, this map is interesting. Back in 1981, a Washington Post writer named Joel Garreau (of Quebecker descent) wrote a fascinating book called The Nine Nations of North America, showing how the continent naturally fell into nine distinct regions (with a few outliers like Manhattan, D.C., and Hawai'i). There's a main web site where the whole book is available online; it's back in print, too.


The Heinlein Index

The Heinlein Index is the answer to the question:

For how many minutes must a journeyman carpenter [that is, neither an apprentice nor a master] labor in order to be able to buy one additional kilogram of the local standard bread?

For the United States, 2000 data (the most recent I can easily find) shows the median hourly wage of a journeyman carpenter as USD 17.28, and a loaf of bread (approx. 1 lb = 1/2.2 kg) costing USD 2.50. That leads to a current HI of about 19.

The neat thing about the HI is that it represents the marginal relative value of labor, and thus neatly compensates for not only the varying cost of living, but the varying standard of living. We all have to eat, though some of us live in caves and others in high-rises.

The Internet Oracle

The Internet Oracle is, well, it's hard to say what it is: a game, a service, an outlet for the imagination, a virtual personality? But anyhow, you submit a question to it, and you get back an answer, typically a humorous and creative one. There are traditions about how you formulate questions (by groveling, in short) and how the answer is worded (it always ends by telling you what you owe the Oracle). In fact, though, what you owe is an answer to someone else's question which the Oracle will send you one day.

Here's a reply I just got:

The Internet Oracle has pondered your question deeply. Your question was:

O sagaciousest and perspicaciousest of Oracles, it's little I know of the duties of men of the sea, but I'll eat my hand if I understand however the Narrator can be at once the cook, and the Captain bold, and the mate of the Nancy brig, and the bosun tite, and the midshipmite, and the crew of the Captain's gig. Can you please explicate this matter for the future benefit of your unworthy slave?

And in response, thus spake the Oracle:

I was musing over this with some... contacts of mine, who received this piece of intelligence with great interest. Turns out Mr N. Arrator has been claiming paychecks on behalf of at least six different people, none of whom were legally entitled to work in this country. Rest assured Mr Arrator will be spending a long time paying for this.

Oh, and thanks to this tip-off, I negotiated my way out of some rather nasty audit proceedings regarding my tribute, so you owe the Oracle nothing. This time.

Internal evidence shows that the crew of the Captain's gig in fact numbered four.


Sorry to be a pain about this

Sorry to be a pain about this, but I'm enabling Blogger's CAPTCHA feature, so you are going to have to type in the string of letters and numbers you see in a graphic in order to comment. I'm tired of deleting comment spam, especially since Blogger doesn't seem to notify me when people post comments.



One of the differences between Thurrodowism and Christianity is that although both have an Old Testament and a New Testament, in Thurrodowism not only can the New Testament correct and update the Old Testament, but the Old Testament can correct and update the New Testament.

For example, the Younger Canon warns against enterprises that require getting dressed up, whereas the Elder Canon tells us that sometimes getting dressed up is exactly what people want to do. This is pointed out in the notes to Grandmother Little Bear Woman's "collation/ripoff" of the Elder Canon, The Way and The Power of the Way.

And if you ask how this can be true, I merely reply that Laozi was one hell of a clever bastard.


The Song of the End-of-lifed

I posted this to xml-dev on a Friday afternoon back in 2003. No tune is required, but you can sing it to any of various tunes that fit the original if you want to.

We have fed our code with a thousand docs
  And still it cries unfed,
Though there's never a tag of all those tags
  But marks our hackers dead:
We have sent our best through the standards mess,
  Past the Borg and the Borg's own gull,
If blood be the price of SGML,
  By Charles, we ha' paid in full!

There's never a flood on the list we love
  But it trashes the posts we wrote;
There's never an ebb of sensible mail
  But the poets come out, quote unquote,
With their haiku (bad) and their limericks (worse),
  And their double dactyls too.
If blood be the price of XML,
If blood be the price of XML,
  By Jon, we ha' pushed it through!

So we'll parse our well-formed marked-up text,
  For that is our doom and our pride,
As it was when they punched on the 029
  So it is with the Code Worldwide.
Though we disagree upon every point
  To this one fact we can swear:
If blood be the price of XSLT,
If blood be the price of XSLT,
If blood be the price of XSLT,
  By James, we ha' paid it fair!
      --Not Rudyard Kipling


Funny things

Well, that last post was #150, though I didn't notice that while posting it.

It's a funny thing, isn't it, that the Latin-based preface is English of several centuries' standing, whereas foreword, apparently quite Saxon, is in fact a calque of German Vorwort and was still being condemned by purists in the latter part of the 20th century?

Another funny thing, and a sort of footnote to my posting on doublets: the Greek root authent- came into English twice: once via Latin and French, giving us authentic, authentication and so on; and a second time via Turkish, giving us effendi 'a man of property, authority, or education in an eastern Mediterranean country'.


Trigrams are sequences of three letters in running text, counting space as a letter. I've had this table around for a while, but never publicized it on my web site index. Now I have.

The table is derived from the Brown Corpus, a million words of American English prose, and contains every trigram that appears at least once in every 10,000 words in that corpus (not once in every10,000 times, as the previous version had it).


The Big Eight

Languages with more than a hundred million speakers, that is; not accounting firms, which in any case are now reduced to the Big Four thanks to mergers and malpractice.

Update: Blast it, Japanese has 122 million speakers. Why does Ethnologue claim there are only 8 languages in this range? Screws up the whole post. It makes me wonder if there are more hiding in the database but not readily accessible from the Web pages. Anyhow, thanks to JibberJim for the heads-up here.

Chinese. 873 million native speakers. The Big One, really; it's almost three times the size of Spanish and English, the next largest languages. It's also the most widely taught second language in the world, though most of the people who learn it as a second language live in China, and almost half of the second-language speakers use another Sinitic language as their mother tongue.

Spanish. 322 million native speakers. The widespread one: it's an official or national language in 21 countries plus New Mexico, a state of the U.S. It's truly remarkable how well this old colonial empire has held together linguistically despite vast political and geographical differences; though it's possible to tell where a Spanish-speaker comes from, there are no serious barriers to mutual intelligibility (though there are many jokes about it).

English. 309 million native speakers. English now belongs to the world: about 200 million people speak it as a second language with varying degrees of proficiency, making it the most widely taught foreign language. If we counted English as a Second Language as a separate language, it would make this list the Big Nine.

Bengali. 211 million native speakers, half of them in the small but very densely populated country of Bangladesh (which indeed is named after the language). Most people who speak Bengali, interestingly, are descended from second-language speakers, which gives it an unusual degree of consistency for such a large language.

Arabic. 206 million native speakers. This is the diverse one: "Arabic" is actually a cover term for over 30 closely related (but not always mutually intelligible languages), unified by Modern Standard Arabic, which is an updated version of Classical Arabic, the language of the Quran. Nobody actually speaks it, except for politicians making speeches, and even then the speeches have to be translated into the local Arabic language.

Hindi. 181 million native speakers. If you count the closely related Urdu (different script, different religion, different set of twenty-dollar words), tack on another 60 million native speakers. There are a lot of other languages in India, some of them quite large, and the term "Hindi" covers a lot of linguistic territory: it has been said that the local language in India changes significantly every 100 kilometers.

Portuguese. 177 million native speakers. Bizarrely, the Ethnologue calls this "a language of Portugal", even though only a tiny fraction of Portuguese speakers live there. The overwhelming majority, of course, live in Brazil, the other large non-Spanish-speaking American country.

Russian. 145 million native speakers. Here the much more recent colonial empire didn't hold together so well: except in Russia itself and in Kyrgyzstan, the language is not official anywhere, though there are plenty of Russian-speakers in what Russians call "the near foreign".

So how are we doing if we know all those languages (and hardly anyone does, I'll bet)? Well, we can now handle the native languages of just 40% of the world's population. That's how linguistically diverse the planet is. If you throw in the next 75 languages, with more than 10 million but less than 100 million speakers, you do much better, reaching 79%; and adding to that the 264 languages with more than a million but less than 10 million speakers, and coverage goes up to 93%.

After that, it's the Long Tail, with 6,565 languages with less than a million speakers, covering the remaining 7%. To be sure, sometimes native languages aren't everything: in the Pacific island nation of Vanuatu, with about 200,000 people, no single language has more than about 7000 native speakers, but almost everyone can handle at least some Bislama, an English-based creole (which itself has only 5000 native speakers).

Essentially all the data here comes from the 15th edition of the Ethnologue.


The seven dirty words

George Carlin referred to the Seven Words You Can Never Say On Television as "Anglo-Saxon" (i.e. Old English). Piss, however, is not Old English at all; it's a straight borrowing from French. The agent ending -er in cocksucker and motherfucker is also a borrowing from French -ier, though some words ending in -arius, the Latin ancestor of -ier, entered Old English and now show -er as well.

Fuck and cunt are certainly not French, but don't look like Old English either, for phonological reasons; they are probably borrowings from other Germanic languages. There may be a remote connection to Latin cunnus, as in Latin and Englishcunnilingus.

The remaining two words (shit and tits) as well as the roots cock, suck, and mother, would be quite as clear to Anglo-Saxons (allowing for changes in pronunciation) as to us.

A few funny sayings

I can call spirits from the vasty deep.
Why, so can I, or so can any man;
But will they come when you do call for them?

--Shakespeare, Henry IV, part 1, Act III, Scene 1

"Of course, a certain number of scientists have to go mad, just to keep the tradition alive."

--Matt Ruff, Fool on the Hill
"His movements could be called cat-like, except that he did not stop to spray urine up against things."
--Terry Pratchett, Night Watch
"From the way you attack your consonants as if they were an enemy swordsman and swallow your vowels as if they were a light snack, I would judge that you were raised in the East. Is that not so?"
--Sethra Lavode to Morrolan
--Steven Brust, Lord of Castle Black

From Cbits to Qbits

Here's a really excellent article (available in several formats) by the physicist N. David Mermin of Cornell. It's intended to teach computerniks just enough mathematical quantum mechanics to understand quantum computing. I now feel like I grok it myself. The "Where is ħ‽" section near the end is particularly clever.

Here's to the Duke, God bless her!

The Channel Islands sit in the English Channel off the coast of France, but they are not part of France. Their history and politics are most amazing things, and I can hardly do better than point you all to the Wikipedia article. So go there, read at least the first two sections, and then come back here for one final anecdote (I don't vouch for its accuracy, which is why I haven't merged it into Wikipedia).

When at the height of the Napoleonic Wars the U.K. attempted to provide Guernsey with a Smuggling Act, the States of Guernsey protested that such an act was against the constitution of Guernsey, and therefore could be of no effect there. After all, it was the Channel Islands — as part of Normandy — who had conquered England, not the other way about.

In the end, a compromise was eventually reached whereby all instances of smuggling in the Channel Islands would be tried solely in local courts, whereas any other instance of smuggling outside the U.K. would be tried in the county of Middlesex, as usual. (At the time, piracy and smuggling were, formally speaking, committed "on the high seas in the county of Middlesex".) As you can well imagine, there were no convictions, and the trade routes between England and France remained open.


Tutorials at XML 2005

I'll be presenting two tutorials at XML 2005 in Atlanta.

On Friday, November 18, I'll be re-presenting the half-day tutorial "RESTful Web Services: building them without WSDL, SOAP, or tears" (OpenOffice|Powerpoint|PDF) that I gave at Extreme Markup Languages 2005.

On Monday, November 14, I'll be giving a full-day tutorial on XML schema languages. The first half will cover DTDs, RELAX NG, and Schematron; the second half will be devoted to W3C XML Schema. The first half shares a lot of material with the tutorial "RELAX NG: DTDs On Warp Drive" (OpenOffice|Powerpoint|PDF) that I've given in the past.

Show up, or else!


The black guard and the house wives

Nowadays the word blackguard is a somewhat archaic insult, but the black guard were originally the kitchen servants, who were so-called because they had to deal with coal. As often, words for lower-class people become words for bad people: villain used to mean 'serf' (this meaning is usually spelled villein these days, but the words are the same), and before that it meant 'village-dweller'.

The narrator of Chaucer's Canterbury Tales says of the Knight:

He nevere yet no vileynye ne sayde
In al his lyf unto no maner wight.

which (in addition to being a spectacular example of Middle English multiple negation) means that the Knight was 1) never rude and 2) never behaved like a peasant.

The standard pronunciation of "blackguard" is "blaggard". It's typical for compounds to be pronounced as single words when they get established, and later to undergo sound-change as if they were single words. For example, English has created three separate compounds from house and wife: the modern formation housewife, the Middle English hussif (obsolete now, but still current in the 19th century) meaning 'sewing-kit', and the one dating back to Old English times, which now takes the form hussy.


Yet another filk

Not many people have heard of the song Lillibulero today, though many have heard the tune on the BBC World News Service without knowing what it was, or that it's generally attributed to Henry Purcell. I was reading an article about the near-total loss of snow on Mount Kilimanjaro in Tanzania, made famous by the Ernest Hemingway story "The Snows of Kilimanjaro". Shortly thereafter, I found the chorus of this song dancing in my head. The classic BBC version is on YouTube; it has about the right tempo, but it's only one verse + chorus. You'll need to put it on Loop (right click on the video before starting it) if you want to sing along. The original lyrics, history, other versions, and much more are at Wikipedia.

Ah, brother mine, have you read the report?
Kilimanjaro's melting away.
The summers grow long, the winters grow short:
Kilimanjaro's melting away.


'Jaro, 'Jaro, Kilimanjaro,
All of your snows are melting away;
'Jaro, 'Jaro, 'Jaro, 'Jaro,
Kilimanjaro, melting away.

The rain it will fall, the storm it will storm,
Kilimanjaro's melting away.
Our planet is growing unpleasantly warm,
Kilimanjaro's melting away.


There is a prophecy found in old books,
Kilimanjaro's melting away,
The world will be ruled by dimwits and crooks,
Kilimanjaro's melting away.


And if we don't change our culture of waste,
Kilimanjaro won't be alone.
Antarctican ice will slip off its base,
Katrina will look like a weekend at home.


London, London, New York and London,
Bangkok and Singapore under the waves.
Global warming, global warming ‒
Hundreds of millions in watery graves.

'Jaro, 'Jaro, Kilimanjaro,
All of your snows are melting away;
'Jaro, 'Jaro, 'Jaro, 'Jaro,
Kilimanjaro, melting away.


Dido. As in Queen of Carthage.

Long and long and long ago, my children, before the Internet became a haven for porn, there was a free email-based erotica server. If you sent an appropriately formatted email to louvre at dido dot fa dot indiana dot edu, you'd get back a story that had been posted to the Usenet group rec.arts.erotica. And all was well... until the postmaster at indiana.edu contacted the archive maintainer, and pointed out to him the large number of undeliverable emails that were coming in addressed to louvre at dildo dot fa dot indiana dot edu!

I don't think these addresses work any more, but I've obfuscated them anyhow. Why give the indiana.edu sysadmins even more spam to deal with? By the way, we're talking Indiana University here, not to be confused with Indiana University of Pennsylvania, so called because it's in a town with the unlikely name of Indiana, Pennsylvania (Jimmy Stewart was born there).

Blood groups and true parents

You can't respond anonymously any more. Just make up an identity. Sorry, but the spam got out of control.

Blood type calculator: enter blood types for both parents and find out possible blood types for a child, or blood types for one parent and one child and find out possible blood types for the other parent. Please use this first!

I got a letter two years ago from someone who very much wanted help disentangling her family history. She wrote to me:

I have just found my deceased father's blood group, and it has got me worried. I am AB, my brother is O, my sister is O, and so was my father. My mother was type AB, I think. So the burning question is, Is my father really my father?

I replied:

As you may know, you have two copies of every gene in each cell of your body, and you get one from your mother and one from your father.

For example (and to oversimplify a lot), there are two forms of the gene for eye color, one for brown and one for blue. If you have both genes for blue, you will have blue eyes; if you have one or two genes for brown, you will have brown eyes. I will write B for brown and b for blue. So blue-eyed people have bb genes, whereas brown-eyed people can have BB, Bb, or bB genes. (The gene from the father comes first, so Bb means you got brown from your father and blue from your mother.)

Two blue-eyed parents can only have blue-eyed children, whereas two brown-eyed parents can have blue-eyed children if both of them are of the Bb or bB types, and both happen to give their child the b gene. (You probably know some exceptions: I am one, because my father had blue eyes and my mother had brown ones, whereas my own eyes are blue. But looking closely shows that there are flecks of brown in my eye color; blue here means 100% true blue.)

Moving on to the ABO blood type system. There are three kinds of genes here, A, B, and O. The A gene will cause a person to have red blood cells with the A protein in them, and the B gene will cause a person to have red blood cells with the B protein in them. The O gene doesn't do either one. So if someone's genes are AA or AO or OA, they will have A protein and be of blood type A. Someone whose genes are BB or BO or OB will have B protein and be of blood type B. Someone whose genes are AB or BA will have both proteins and be of blood type AB. And finally, genes that are OO will have neither protein and be of type O.

In your case, your your father's, brother's, and sister's genes are OO. Your mother is AB or BA and so are you. Your mother gave you either an A or a B gene, and you had to get the other B (or A, as the case may be) from somewhere. Your father is OO, so where did the other gene come from?

But that's not the whole story. Your brother and sister are OO, and your mother could not give them an O gene (since she has only an A gene and a B gene), so where did their O genes come from? One possibility is that you're wrong in thinking your mother was AB.

The most probable explanations are adoption, sperm donation, or something else that makes you and your siblings have different genetic parents. A DNA test of you and your siblings, preferably both of them, will nail this very reliably, and I would encourage you to get one. It turns out that about 15% of human beings, on average, are mistaken about their genetic fathers.

There is another possibility. There is another gene known as the H gene, which comes in two varieties: H (working) and h (not working). (The whole issue of ABO and H versus h does not make any difference to health, of course.) Neither the A nor the B protein can be made in your body unless you have at least one H gene. So people who have hh in their genes always appear to have blood type O, because no A or B protein is being made in their bodies even though the A or the B gene might be physically present. So your father might actually have an A or B gene to give you even though his apparent blood type was O, if he also had hh. However, the h gene is quite rare and the hh combination even rarer, so this isn't a very likely explanation.

Finally, she added:

I would be grateful for any help you can give me. I will always love him either way; I just need to know.

I replied:

Of course. As an adoptive parent whose daughter has always known that she is adopted, I know that genetics has very little to do with how we feel about our children or how they feel about us.


This post has obviously struck a nerve: it has gotten more comment than anything I've ever written. If you are going to comment to ask a question, three things, please:

  1. The Rh blood types (+ and -) are separate from the ABO blood types. The only thing to say about them is that two - parents will always have - children; every other combination is possible.
  2. Look in the following A/B/O chart first. Find your mother's blood type across the top, your father's along the side, and your possible blood types in the box.

AA or OAnyAny but OA or O
BAnyB or OAny but OB or O
ABAny but OAny but OAny but OAny but O
OA or OB or OAny but OO

Half a breath

The standard instructions for having a chest X-ray taken are to take a full breath and hold it, then press yourself against the X-ray plate and stand still. When I first got one taken, both the first and second attempts were unusable. An expert (there are geniuses in every field, as Richard Feynman says) was called in, who glanced at the film and told me "Only take half a breath." Apparently my lungs when fully inflated are larger than the standard chest X-ray plate!

It's spelled "colonel", but it's pronounced "kernel"

But why?

Well, because it's French.

But the French word colonel, amazingly for a French word, is pronounced exactly as it's spelled, with no r sound whatever.

The story turns out to be that the Italian word colonello, from Latin columnellus, the leader of a (military) column, got borrowed into French twice. The first time, it became coronel in French, possibly on the notion that it was from Latin corona 'crown' rather than columna.

The form coronel spread to English and Spanish before being replaced in French itself by a second borrowing from Italian, this time more correctly as colonel. The spelling, but not the pronunciation, of this second form then entered English, leaving us with l in the spelling and r in the pronunciation.

Go figure.

Yes, there is that joke

A German professor of philosophy once wrote a three-volume work on das Komische. For the rest of his days, whenever anyone said something funny, he would nod his head soberly and say "Ja, es gibt den Witz."

A diamond-shaped poem on XML schema languages

First, here's the poem:

Complex, inflexible,
Boggling, wearying, proliferating,
Circumscribed, inadequate. Future-proofed, compartmentalized,
Systematizing, pleasing, sufficing,
Straightforward, simplified --

The constraints on this kind of poem is that there are seven lines: lines 1 and 7 contain one noun, lines 2 and 6 contain two adjectives, lines 3 and 5 contain three present participles, and line 4 contains four past participles. I cheated slightly: inadequate is derived from a Latin past participle.

I decided to add a syllabic constraint as well: the syllable lengths in each line are 3, 6, 10, 15, 10, 6, 3, corresponding to a binomial distribution.


On not using more security than you need

This post does not represent the views of my employer or anyone else but me.

Using secure transmission channels involves two main issues: content security and end-to-end authentication. Consider the case of someone who provides news to a variety of paying customers. Does it make sense to use something stronger than just ordinary FTP or HTTP with usernames and passwords sent in the clear? Maybe not. Communications security isn't free, after all; it costs the provider for certificates, bandwidth, and computer cycles, and it costs the client likewise, plus the extra programming effort if the news is to be automatically processed.

The information in news is essentially all public knowledge: its value resides in its timeliness and reliability. Typical customers for news pay by an annual contract, not per download, so they have nothing to lose if the content is stolen by some third party. Does the news provider? Probably not, unless the theft happens on a truly massive scale. Occasional freeloaders are simply no big deal.

As for reverse authentication (is the customer getting its news from the real news provider?), a successful DNS spoof would be far more effectively employed against some e-business site that actually passes around credit card details or the like. News just isn't in that category.

Finally, why would anyone pay to get news in the first place that is available on many websites for free? For one thing, news changes, by definition; this tends to improve the stickiness of sites that display it; when users return to the site, they will find that things have changed at least somewhat. For another, the client may believe that the news provider's credibility (a non-credible news provider doesn't last long) will rub off on him.

Powerpoint, headlines, and captions

Lots of people lately have been denouncing Powerpoint presentations, a term I am here using generically for slides full of bullet-point text. If you want to make or view such presentations, use OpenOffice.org, and if you have numbers to show, use proper graphs instead of textual representations.

But I want to talk about a different point. When you look at a Powerpoint, you find that quite frequently you can't understand it without the talk that went with it; it's just a collection of phrases with no indication of what they might refer to. But I've been told by several people that the four technical presentations on my website are quite intelligible from the slides alone even without my actual talk.

I have a theory about why this might be so. The bullet points on my slides are headlines, whereas the points in this wonderful parody of Lincoln's Gettysburg Address (via Language Log) are simply captions. The difference is that a headline is a complete sentences, possibly with a few omitted words, whereas a caption is just a noun phrase.

Newspapers have been around for a long time, but headlines are just over a century old: the Hearst papers pretty much invented them as part of hyping the Spanish-American War. Before that, in the Civil War, for example, war news was typically headed "The War" or some equally nondescript caption. It wasn't until early in the 20th century that the principle "The headline tells the story" was fully adopted.

Headlinese, at least English-language headlinese, isn't quite grammatically equivalent to ordinary English. Articles are often left out, and so is the copula is: headlines have to fit into a confined space across the column. It's always straightforward, though, to supply the missing words in order to reconstruct the original. When it isn't, you get the broken headlines that appear on the lower left corner of the Columbia Journalism Review home page.

One entirely modern newspaper headline did appear as long ago as 1781, however, announcing the outcome of the battle of Yorktown, the last major battle of the American Revolution. Two words told the story: CORNWALLIS TAKEN!


The not-so-British Commonwealth

What are the Commonwealth countries?

Antigua and Barbuda, Australia, Bahamas, Barbados, Belize, Botswana, Canada, Grenada, Jamaica, New Zealand, Papua New Guinea, Solomon Islands, St Kitts and Nevis, St Lucia, St Vincent and the Grenadines, Tuvalu, U.K. (17 countries) are constitutional monarchies ruled by Queen Elizabeth II. Except for the U.K., the Queen delegates her powers to a Governor or a Governor-General.

Malaysia, Brunei, Lesotho, Samoa, Swaziland, Tonga (6 countries) are monarchies with other monarchs as heads of state. The Malaysian monarch is chosen by his peer monarchs in the Malaysian Federation from among themselves; the others are straight hereditary monarchies.

Bangladesh, Dominica, Fiji, India, Malta, Mauritius, Pakistan, Singapore, Trinidad and Tobago, Vanuatu (10 countries) are parliamentary republics.

Cameroon, Cyprus, Gambia, Ghana, Guyana, Kenya, Kiribati, Malawi, Maldives, Mozambique, Namibia, Nauru, Nigeria, Seychelles, Sierra Leone, South Africa, Sri Lanka, Tanzania, Uganda, Zambia, Zimbabwe (21 countries) are presidential republics. Mozambique is unusual in being a former Portuguese rather than a former British colony (though some have said it is the colony of a British colony...)

Note that Ireland, a presidential republic, is not part of the Commonwealth.

Horhorn, quickening, and wombfruit

Someone asked me the meaning of the following signature line that I use:

Deshil Holles eamus. Deshil Holles eamus. Deshil Holles eamus.
Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x)
Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa!

saying: "It reads like the gibbering of a schizophrenic. Is it anything but?"

It is, indeed, anything but. The text is from the "Oxen of the Sun" chapter of James Joyce's novel Ulysses, and consists of remarks that can be heard in and around Holles Street maternity hospital in Dublin. Deshil is Irish for "street", eamus is Latin for "let's go"; the speakers are medical students, who know enough Latin and Irish to fool around in both languages, even simultaneously.

The "Send us" line is the prayer of the expectant mothers. I do not know exactly what "Horhorn" means, but "quickening" is pregnancy, and "wombfruit" is a baby, with reference to the familiar prayer to Mary: "... and to the fruit of thy womb, Jesus." Indeed, the German version "Schick uns, du Heller, du Lichter, Horhorn, Leben und Leibesfrucht" leaves "Horhorn" untranslated. (3x) is just my way of saving space; the original repeats this line, like the others, three times.

As for the last line, it represents what someone (the midwife or obstetrician) is saying at the delivery of a male infant, perhaps a first child (they tend to linger a long time). The German version is straightforward: "Hopsa, ein Jungeinjung."

Dreein the weird

Scots has two verbs corresponding to English "endure, put up with":

to put up with something because one has no choice
to put up with something as a choice

Vocabularists may be interested in this contrast. I found it at a page of Scots prescriptivism written in Scots.

The phrase "dree one's weird", therefore, means not merely to endure one's fate, but to choose to endure one's fate.

All the character codes in the world

This is not a proposal to change standards in any respect. It's just a thought-out (well, somewhat) approach for people who have to represent character codes as opposed to characters, and have 32 bits to play with.

The intent is to represent all the codes of all the registered character sets, present and future, as individual unsigned 31-bit integers. All further numbers in this post, except 94, 96, and 2022, are base 16.

Unicode codes are mapped onto the integers 0-10FFFF in the obvious way. The registered character sets of ISO 2022 are represented by codes above 2000000.

The detailed roadmap is as follows:

  • 00000000-0010FFFF: Unicode
  • 00110000-1FFFFFFF: reserved
  • 20000000-2003FFFF: ISO 2022 94-char, 96-char, C0, and C1 character sets
  • 20040000-2093FFFF: ISO 2022 94x94/96x96-char character sets
  • 20940000-5693FFFF: ISO 2022 94x94x94/96x96x96-char character sets
  • 56940000-7FFFFFFF: reserved

Definitions for ISO 2022 character sets:

  • Every character set has an ISO-specified value between 40 and 7E, called F.
  • Some character sets have an ISO-specified value between 21 and 2F, called I. If I is not present, it is deemed for our purposes to 20.
  • Individual characters in one-byte character sets have a value between 20 and 7F, called H.
  • Individual characters in two-byte character sets have two values between 20 and 7F, called H and L.
  • Individual characters in three-byte character sets have three values between 20 and 7F, called H, M, and L.


  • The value of a character in Unicode is its scalar value.
  • The value of a character in a 94-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H.
  • The value of a character in a 96-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H + 80.
  • The value of a character in a 94x94-char or 96x96-char character set is 20040000 + (I - 20) * 90000 (F - 40) * 2400 + (H - 20) * 60 + (L - 20).
  • The value of a character in a 94x94x94-char or 96x96x96-char character set is 20940000 + (I - 20) * 3600000 + (F - 40) * D8000 + (H - 20) * 2400 + (M - 20) * 60 + L.

This scheme was inspired by a related scheme by Markus Kuhn.


My nose is shinier

I quoted the following exchange of dialogue on a mailing list a few years ago:

"I know more big words than you do!"
"But I can spell them better."
"My hair is wavier."
"My nose is shinier."
"But listen freak, do you have your very own Galactic technology flies-through-the-air SURFBOARD, hmmmmmm?"
     --Mr. Fantastic and Dr. Doom

Not unnaturally, another participant wanted to know what Norrin Radd was doing in the conversation all of a sudden. I explained that at this particular point in the Marvel Universe, Dr. Doom had seized control of the Power Cosmic (including the surfboard) from the Silver Surfer. The tale's retold here and here; the latter page displays marvelous cover art showing Victor Von Doom Silver-Surfer-ized.

But as you might well suppose, this particular conversation is not from any straight Fantastic Four issue. Rather, it comes from Marvel's short-lived late-60s self-parody, Not Brand Ecch, which featured (among other things) the adventures of Charlie America, the Inedible Bulk, and the Mighty Sore*, as well as Forbush-Man, the superhero identity of Marvel janitor Irving Forbush, frequently referred to in the letters column.

* Punchline of a joke: "You're Thor? I won't be able to thit down for a week!

Unicode is big enough

People tend to be skeptical that the 17 * 65536 = 1,114,112 character codes provided by Unicode will be big enough. After all, we have moved from 8-bit to 64-bit computers, both in word size and in address size; in general, most finite limits have been repeatedly shown to be insufficient. The maximum normal memory on MS-DOS-based PCs was 640K, ten times as big as the 64K limit on the 8-bit systems that preceded them: after all, as Bill Gates supposedly said back in 1981, 640K of memory ought to be enough for anybody!

In fact, though, there just aren't any huge and complicated writing systems hiding in some remote ravine. We have a pretty good map of all the writing systems on the planet; a few may have been overlooked by accident, but none of them are going to be huge. The biggest remaining ones are Egyptian hieroglyphics and ancient Chinese characters, and neither of them will require anything like a million character codes.

There are other ceilings in computing that aren't likely to be broken through either. Consider the number of different assembly-language op codes. Does anyone foresee computer chips with 65,536 different opcodes? How about 4,294,967,296 distinct opcodes? I don't think so.

Or consider IP version 6 network addresses. There are 2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 of them. They won't be assigned densely, according to current plans, but they could be, and that would be enough IP addresses to have a few billion addresses for every soil bacterium in every square centimeter of soil on the planet. Does anybody really believe we are going to "break through" that?

Celebes Kalossi: Who knows best?

This post expands and enlarges on my previous post OOP Without Inheritance.

Celebes Kalossi (CK) is the name of a model of object-oriented programming I'm developing. It's also the name of a (currently hypothetical) programming language that implements it. Here I'm going to talk about how objects are constructed in CK. There will be later posts that discuss various other features. I'll say right off that CK is class-based (loosely speaking) rather than delegation-based like Javascript or Self.

Mainstream OO programming languages like Smalltalk, Java, C++, and C# natively support a Daughter Knows Best (DKB) model: a method in a subclass overrides the correspondingly named method in a superclass. All except Smalltalk also require that the static types of the arguments match in order for an override to be successful. I say Daughter Knows Best because the overridden method need not be invoked at all unless the overriding method decides to do so.

Simula and Beta, per contra, use a Father Knows Best (FKB) model: the superclass method is invoked first and foremost. In fact, subclass methods are not invoked at all unless the superclass method decides to do so. This is perfectly symmetrical with the "Daughter Knows Best" model.

There are advantages and disadvantages to both models. In DKB, the subclass can take complete control, which is sensible because it understands its own needs better than the superclass. However, when a superclass method invokes a subclass method under the guise of invoking its own method (because the self/this object belongs to the subclass), it has no guarantees that the subclass method will Do The Right Thing. In FKB, the subclass is stuck with the behavior that the superclass imposes, for good and bad: good, because the superclass can prevent the subclass from going off the rails; bad, because the superclass may do things the subclass does not want.

In CK, all classes are equal, and specify exactly which behavior is appropriate for them. This is achieved by partitioning the notion of class into two notions: type and behavior. To create an object, you specify its type: in statically typed Celebes Kalossi, variables specify their types. The methods you can invoke on a type are the methods it declares public, plus the methods declared public by all its supertypes. A type can have more than one supertype. Consequently, a type is like an interface in Java or C#, except that types can have instances, whereas interfaces can't.

A behavior is simply a named set of methods plus instance variables. Objects cannot be created nor can variables be typed using the name of a behavior. The instance variables are always private to the behavior alone, so if you want to make them accessible to the outside world, you must provide getter and/or setter methods for them within the behavior. The methods, however, can be specified as private, external, or standard (no keyword in the syntax). Private methods, like instance variables, are defined by the behavior and visible only within the behavior. External methods can only be declared, not defined, and indicate which methods this behavior depends on but does not itself define. Finally, standard methods are defined by the behavior and are available in types that use the behavior.

What do I mean by "use the behavior"? In the declaration of a type, the programmer can specify that it has no behaviors, in which case it is an abstract type, or the programmer can specify one or more behaviors. The behaviors must fit together like jigsaw puzzle pieces: if a method is declared as external in one or more behaviors of the type, it must be defined with a standard definition in exactly one behavior of the type. Furthermore, each of the public methods of the type (including the public methods of its supertypes) must correspond to a standard method defined in a single behavior. However, behaviors are never inherited from supertypes: a type that wishes to have the same behavior as a supertype must specify that behavior explicitly.

The model also provides control of the visibility of individual method names. When a type declares that it uses a behavior B, it may mention standard methods to be suppressed or renamed. A suppressed method is not visible to the jigsaw-puzzle mechanism described above; this allows creating types whose behaviors have method names that accidentally clash. Renaming a method allows it to be invoked from other behaviors under a different name. This allows the DNB model to be simulated: the subtype uses its supertype's behavior, suppressing conflicting methods that are not going to be called, and renaming ones that are. Thus there is no mechanism in CK corresponding to super in DKB languages or inner in FKB languages.

There is some syntactic sugar that makes the model easier to use. In particular, a behavior A can declare that it uses another behavior B. This simply means that whenever a type uses behavior A, it also automatically uses behavior B; no special relationship between A and B is necessarily implied. Furthermore, types can define instance variables, private methods, and standard methods as well as declaring external methods. In terms of the model, these things are really declared in an anonymous behavior which the type automatically uses.

Behaviors are something between mixins and traits. Traits don't have instance variables, which in the traits model are defined within classes that incorporate the traits. Mixins have instance variables that are visible to all other mixins within the class, bringing in all the problems of unrestricted multiple inheritance. Behaviors share the orthogonality of traits: you can just combine them without worrying about the order in which they are to be combined, since identical instance variable names are irrelevant (instance variables being private), and identical method names are forbidden unless renamed or suppressed. Since each behavior carries its own state with it, and behaviors are plug-replaceable, one can create closely related types, or even equal types (loops in the subtype-supertype relationship are not forbidden) that implement variant behaviors.


HTML attributes that are URIs

This is a list of HTML 4.01 attributes of type URI and what they are about.

a/@hrefhypertext reference
applet/@codebaselocal base URI
area/@hrefhypertext reference
base/@hrefglobal base
blockquote/@citesource of quotation
body/@backgroundbackground image
del/@citeexplanation of change
form/@actionserver-side form handler
frame/@longdesclong description
frame/@srcsource of frame content
head/@profilenamed dictionary of metainfo
iframe/@longdesclong description
iframe/@srcsource of frame content
image/@longdesclong description
img/@srcimage reference
img/@usemapclient-side image map
input/@srcimage references
input/@usemapclient-side image map
ins/@citeexplanation of change
link/@hrefhypertext reference
object/@classididentifies an implementation
object/@codebaselocal base URI
object/@dataobject's data
object/@usemapclient-side image map
q/@citesource of quotation

Knowing knowledge

The following four rules explain what it is to know something. X knows the proposition p if and only if:

  1. X believes p;
  2. p is true;
  3. if p weren't true, X wouldn't believe it;
  4. if p were true, X would believe it.

Why do we need the fourth condition? To eliminate what are called Gettier counterexamples. This one is due to Quine:

Four men set sail from Boston on 8 November 1918 with the justified false belief that the War in Europe was over (because reports to that effect had been circulated in the newspapers). They arrived in Bermuda four days later with no further information, but now their belief was true. However, it did not count as knowledge, because the justification and the truth were entirely independent of each other.

Doing it yourself

Once there was an internetworking protocol named DECnet, which like IP used 32-bit addressing. When the time came to support Ethernet and similar LANs in DECnet Phase IV, the mapping of DECnet node addresses to 48-bit Ethernet MAC (hardware interface) addresses was solved by changing the MAC address in software to be the same as the DECnet address! (DEC did set the "locally administered" bit in all such MAC addresses.)

When the programmer responsible was located and asked "Why didn't you use ARP?", the IP protocol for dynamically mapping IP addresses to MAC addresses, he simply replied "What's ARP?"



I just found out from a Time Magazine interview that J. K. Rowling pronounces her villain's name with a silent final t. My middle name is pronounced German fashion, with the initial letter like an English v.

Sincerely yours,

John Woldemar Cowan


GPLed software is not Debian-free

Really. The GNU GPL violates the Debian Tentacles of Evil test, because it is a bare license, and (with proper notice) a licensor can revoke a bare license at any time. At most the people already relying on the license may be able to use the legally tricky doctrine of promissory estoppel to go on relying on it. Everyone else is SOL.


Monochrome and Gentium

As you can see, I've switched my blog to monochrome. I've also switched to using the Gentium font if you have it. If you don't, do yourself a favor and download it (a ZIP archive; for the Mac and Linux RPM formats, see the download page).

Unicode and fonts

This piece is too big for a blog entry, so it's on my web site.


18th century chitchat at Samuel Johnson's table

SIR A [Alexander Macdonald]: "I think, Sir, almost all great lawyers, such at least as have written upon law, have known only law, and nothing else."

JOHNSON: "Why no, Sir; Judge Hale was a great lawyer, and wrote upon law; and yet he knew a great many other things; and has written upon other things. Selden too."

SIR A: "Very true, Sir; and Lord Bacon. But was not Lord Coke a mere lawyer?"

JOHNSON: "Why, I am afraid he was; but he would have taken it very ill if you had told him so. He would have prosecuted you for scandal."

BOSWELL: "Lord Mansfield is not a mere lawyer."

JOHNSON: "No, Sir. I never was in Lord Mansfield's company; but Lord Mansfield was distinguished at the University. Lord Mansfield, when he first came to town, 'drank champagne with the wits,' as Prior says. He was the friend of Pope."

SIR A: "Barristers, I believe, are not so abusive now as they were formerly. I fancy they had less law long ago, and so were obliged to take to abuse, to fill up the time. Now they have such a number of precedents, they have no occasion for abuse."

JOHNSON: "Nay, Sir, they had more law long ago than they have now. As to precedents, to be sure they will increase in course of time; but the more precedents there are, the less occasion is there for law; that is to say, the less occasion is there for investigating principles."

SIR A: "I have been correcting several Scotch accents in my friend Boswell. I doubt, Sir, if any Scotchman ever attains to a perfect English pronunciation."

JOHNSON: "Why, Sir, few of them do, because they do not persevere after acquiring a certain degree of it. But, Sir, there can be no doubt that they may attain to a perfect English pronunciation, if they will. We find how near they come to it; and certainly, a man who conquers nineteen parts of the Scottish accent, may conquer the twentieth.

"But, Sir, when a man has got the better of nine tenths he grows weary, he relaxes his diligence, he finds he has corrected his accent so far as not to be disagreeable, and he no longer desires his friends to tell him when he is wrong; nor does he choose to be told. Sir, when people watch me narrowly, and I do not watch myself, they will find me out to be of a particular county [Staffordshire]. In the same manner, Dunning may be found out to be a Devonshire man. So most Scotchmen may be found out.

"But, Sir, little aberrations are of no disadvantage. I never catched Mallet in a Scotch accent; and yet Mallet, I suppose, was past five-and-twenty before he came to London."

     --Boswell's Life of Johnson for 1772

Before I hear about it for "Scotch" and "Scotchman", I will point out that Boswell, a Boswell of Auchinleck and most certainly a Scot of the Scots, uses these forms himself.

Wireless hackers ca. 1903

This article illustrates neatly that our predecessors of a century ago were much like us. I was especially struck by the vadding needed to lay down an amateur (wired) telegraph line, the "my transmitter has more kW than yours" battle, and the "I can't remember his name, but his callsign is ...".

Two paternosters in Scots

A modern version:

Faither o' us aa, bidin abune,
thy name be halie.
Let thy reign begin.
Thy will be dune,
on the erthe, as it is in Hevin.
Gie us ilka day oor needfu fendin
an forgie us aa oor ill-deeds,
e'en as we forgie thae wha dae us ill
as lat us no be testit,
but sauf us frae the Ill-Ane,
for the croon is thine ain,
an the micht,
an the glorie,
for iver an iver.

A more traditional version:

Oor faither in heiven
hallowt be thy name.
Thy kingdom come,
Thy will be dune,
on the yird as in heiven.
Gie us oor breid for this incomin day.
Forgie us the wrangs we hae wrocht,
as we hae forgien the wrangs we hae dree'd.
An say us na sairlie.
But sauf us frae the ill-ane.
And thine be the kingdom,
the pooer, an the glorie,
noo an forivver. Amen.

Note in each case the semantics of the Seventh Petition: not "deliver us from evil", but specifically "deliver us from the Evil One."

Old Irish satire

Although Westerners have pretty well given up believing that the name is the thing, or even that there's some kind of resonance between names and things, it hasn't always been so. My ancestors were about as western as you can get in Europe, and they used names to induce malefactors to commit suicide: "sticks and stones may break my bones, but names can really mess me up."

Here are a few examples of Old Irish satire, that most feared of all arts:

A satire on Bres, "the first satire made in Ireland", from the story of the Second Battle of Mag Tuireadh.

Cen cholt for crib cernini,
Without food quickly on a dish,
cen gert ferbba fora n-asa aithrinni,
without cow's milk on which a calf thrives,
cen adba fir iar ndruba disoirchi,
without a man's dwelling after the staying of darkness,
cen dil daimi rissi,
without a storyteller band's payment,
ro sain Brisse.
so be Bres.

A "poem to raise blisters" from Cormac's Glossary:

Maile baire gaire Caier
Evil, death, short life to Caier
combeodutar celtra cath Caier
May battle spears slay Caier
Caier diba Caier dira Caier foro
Caier by land, Caier by earth, Caier rejected
fomara fochara Caier.
Under mound, under rocks, Caier.

And a single line from a much longer satire; each word means "I will satirize" using three different roots:

gromfa gromfa glamfa glamfa aerfa aerfa

I suspect that these curses probably did "work" in their context: an intensely shame-oriented (as opposed to guilt-oriented) culture in which people identified extremely strongly with their public reputations, to the point where the destruction of that reputation could send them into overwhelming depression. (God knows it isn't so today: the Irish have acquired an overwhelmingly guilt-oriented culture somewhere along the way.)

I often wonder if Japan might have been able to escape the spiral of events leading to World War II if they had institutionalized public satire as an escape valve.

Go away, you silly samurai, or I will satirize you ... a second time!


Problems with quantitative reasoning

I'm not a linkblogger, really I'm not, but this ham tale gave me something I badly needed lately: a really thoroughgoing laugh attack, the kind where you check yourself at the end to see if you need to change your clothes.

Of course, the error in quantitative reasoning was in supposing that a Quarter-Pounder contains 1/4 lb of hamburger after it's cooked: it contains maybe a third of that.


Regularized Inglish

I know this looks like a whole mess of misspellings, but it's actually a very sensible spelling reform (not revolution) devised by Axel Wijk and published in his book Regularized Inglish back in 1959, and simplified slightly by me. This is the American version.

Wunce upon a time thare livd a poor boy named Dick Whittington, hooze faather and muther were bothe ded. Having neether home nor frends, he roamed about the cuntry trying to ern hiz living. Sumtimes he cood not finde any wurk, and he offen had to go hungry.

On market days he herd the farmers tauk about the greit city of Lundon. They sed that its streets wer paved with gold. So Dick made up hiz minde to go to Lundon and seek hiz fortune. Packing hiz clothes into a bundle and cauling hiz faithful cat he started out. After days and days ov wauking, the hungry lad finally reached Lundon.

But alas, the streets were not paved with gold but hard cobblestones. He wondered [i.e. wandered] about the city seeking for wurk. At last, he came to the house ov a rich merchant and knocked at the dor.

The dor woz opened by the cook, but when she oenly saw a ragged boy on the step, she woz angry and told him to begon. At that moment the oener ov the house, Master Fitzwarren, returned and seeing the ppor boy's condition he took pity on him and ordered the cook to giv him sum foode. "If yoo wish to wurk", he added, "yoo may stay here and help cook in the kitchen. Yoo will finde a bed in the attic." Dick thanked Master Fitzwarren very much for hiz greit kindeness.

Dick might hav been happy had it not been for the cook, hoo whipped him aulmoste every day. She treated him so badly that the merchant's daughter, hoo woz a kinde-harted girl, felt very sorry for the lonely lad.

Wun day the merchant cauld aul hiz servants together. He told them that he had a ship reddy to sail to forren lands, and that each ov them might send sumthing in her, and they shood hav aul that it sold for.

"What ar yoo going to send, Dick?" asked the merchant's daughter.

"I hav nuthing to send," sed Dick sadly, "nuthing but my cat."

"Fetch thy cat then, boy, and send her!" sed the merchant.

Dick woz sorry to part with Poossy, yet he obeyed his master, and with tears in hiz ies gave Pooss to the capten ov the ship.

Aultho Dick wurked hard and tried to pleaze the cook, she continued to beat and torment him. At last he cood not stand it eny longer and made up hiz minde to run away. Wun morning he got up very erly, and packing hiz few things into a tiny bundle, he slipped out ov the house. When he got az far az a place cauld Highgate, he felt tired and sat down thare to rest. Suddenly the bells ov Boe Church began ringing and az he lissened it seemed to him that they wer saying:

Ding-dong, ding-dong,
Turn agen, Whittington,
Thrice Lord Mayor ov Lundon!

"Lord Mayor ov Lundon", he said to himself. "Hoo wood not be Lord Mayor ov Lundon? But if I run away I'll never hav a chance. I'll go back again and endure aul the cook's beatings raather than miss such a chance." Back he hurried and managed to get into the house before the cook had cum down.

While aul this woz happening, the ship with Dick's cat woz bloen by a storm to a distant cuntry inhabited by Moors. Theze peeple receeved the capten and hiz men kindely and wer anxious to see whot the straingers had in their ship. The capten shoed [i.e. showed] them hiz good and aulso sent sum samples to the king ov the cuntry.

The King woz so well pleazed with the samples that he invited the capten to hav dinner with the King and Queen. Az soon az the dishes wer braught in and poot down on the table, an immense number ov rats rushed out from every side and sworming over the foode, ate it nearly aul up. The capten woz amazed at this and asked the King how he cood stand such a thing.

"But whot can I doo to stop them?" sed the King. "I wood gladly giv haf my kingdom to get rid ov theze pests."

Then the capten thaught of Dick's cat and told the King that he had a little animal on hiz ship that wood make short wurk ov theze creatures.

"Go bring this wunderful animal to me," cried the King, "and I will load yoor ship with gold and jewels in exchainge for her."

The capten hurred off while anuther dinner woz being prepared, and when he returned with Pooss, the rats wer bizzy eating that aulso. Down amung them he put Pooss and she flew around killing a greit number, while the rest ran away.

The King and Queen wer overjoyed to see their enemies thus dispersed, and when the capten sed that he wood be onnord if they wood allow him to make them a prezent of Pooss, the King woz so delighted that he baught aul the ship's cargo and gave ten times az much for the cat.

The ship then sailed back with fair winds to Ingland. On arriving home the capten went to the merchant and shoed [i.e. showed] him aul the trezures that the King had given for Pooss. The onnest merchant at wunce sent for Dick and congratulated him on having becum a rich man. "Yoor cat haz braught yoo more munney than I pozess," he sed. "May yoo liv long to enjoy it."

Dick fell on hiz knees and thanked Heven for hiz good fortune. He then reworded [i.e. rewarded] the capten and the crew and aulso gave prezents to aul the servants, even to the cross-tempered cook.

Later on Dick married hiz master's daughter and the yung cupple lived long and happily. The prophecy that the bells ov Boe Church had chimed in the ears ov the ragged boy later came true. Three times woz Dick Whittington Lord Mayor ov Lundon.

Question and answer

On a mailing list I was administering, a user enquired:

All quiet: no whines,
no rants, no eye-glazing screeds;
Was I unsubscribed?

I replied:

No, not at all, Sir.
Our view of you is unchanged.
Fix your spam filters.


Most of this material comes from Steven Pinker's book Words and Rules".

There are about 165 inherited irregular verb roots in English (for example, see, saw, seen), and maybe 35 irregular noun roots (for example, foot, feet). This does not count the Latin and Greek plurals, which we typically learn in school and don't acquire with the rest of the language.

In English, the regular nouns and verbs are the most common kind, but this isn't true in some other languages: In closely related German, for example, the overwhelming majority of nouns are irregular (the regular ending -s, although applicable to all sorts of nouns, is quite rare), and there are far more irregular verbs than in English.

Similarly, the noun classifier system in Chinese and other languages operates quite analogously to irregular noun plurals in other languages; there is a regular classifier ge, and then there are lots of fuzzily defined families of nouns, each with its specific classifier. These families tend to be organized on semantic lines, but with lots of exceptions.

For example, the Chinese classifier for human being has a respectful tone, and the word for thief doesn't normally take it, using the regular classifier instead. It's a defect in most Chinese dictionaries that they don't list the most usual classifier for a noun, in the way that French dictionaries show gender.

In Japanese, which has the same kind of classifier system, the classifier for book remains the one for vertical cylinders, despite the prevalence of codices over scrolls for some generations now. In Burmese, where nouns can almost always be used with more than one classifier, a semantic explanation tells us why basket of cows is a forbidden combination, but does not explain why a team of horses is also forbidden: one must refer to the fact that Burmese do not happen to use teams of horses.

It makes a great deal of difference whether a word is regular or irregular: compounds formed from them obey quite different rules. A compound or idiom whose head has an irregular root is irregular: overate, undid, bogeymen, stepchildren, milk teeth, straw men, oil mice, beewolves (a kind of wasp), cut a deal, bought the farm, caught cold, went bananas, threw up. Note that He threw up the ball (not an idiom) and He threw up his lunch (idiomatic) are syntactically indistinguishable; in either case, the up can be postposed.

However, a bahuvrihi compound is regular: tenderfoots, sabertooths, lowlifes, flatfoots, still lifes. Walkmans is also headless, though not technically bahuvrihi (if it were, it would mean "one who walks like a man" or something similar).

Rootless nouns and verbs made from names, quotations, sounds, abbreviations and foreign words are regular: I've been Rolling Stoned and Beatled till I'm blind, There are five "man"s on that page, The tire made several pffffts.

Denominal verbs where the verb is derived from an irregular noun are nevertheless regular: stringed means 'having had a string removed', despite the verb string, strung underlying the noun string, and to be put out (in baseball) by reason of hitting a fly ball which is caught is to be flied out, not flew out, because of the intervening noun fly 'fly ball'. Likewise, The doctor slided the sample means he put it on a microscope slide.

Regularly inflected nouns can't normally be incorporated into compounds: mice-eater is acceptable, rats-eater is unacceptable. Exceptions arise when the plural form refers to a heterogeneous collection of individuals treated as distinct entities: a dog hater needn't hate any particular dog (he need not even have met a dog), but an enemies list is a list of specific, individual enemies. Likewise with Landmarks Commission, singles bar, the Morphemes Project. Pinker collects these and puts them on his (what else?) exceptions list.


Loyal opposition

In a post on WS-* last year, Tim Bray used the phrase "Her Majesty's Loyal Opposition", saying:

The idea is, they Oppose the Government but are Loyal in that they promise not to lead a mob with pitchforks to string them up; and they stand ready to provide an alternative.

Historically, the Opposition was the Loyal opposition not because it was against overthrowing the monarchy, but because in George III's day there were the "King's friends" and then there were the others, who most certainly did not want to be thought of as the "King's enemies". The term "Loyal Opposition" served to deflect the antagonism of the Crown to those who did not support its policies, whatever they happened to be. Later, when the Crown became a captive of whoever won the elections, the sense of the term shifted a bit.

But hang in there loyally opposing. Those who take pleasure in common sense and sound design support you.


Tim Bray wrote:

There seem to be two kinds of XML projects: those where they send some emails and examples back and forth and are now in production, and those where they strike a task force to assemble the schemas, and the project is still in committee stage.

I replied that there is a third road, the Fascist-Publisher approach:

All our news articles conform to one of these two DTDs. One is a subset of XHTML with some additional elements, the other is IPTC NewsML. Take it or leave it.

The only problem with this approach is the following too-often-repeated dialogue:

Me: We can provide you with news in HTML.

Hapless Customer: Don't you have XML?

Me: Our HTML news is well-formed XML. [It would be valid, but there is no DOCTYPE declaration, to prevent hammering on the DTD on our web server.]

HC: But we want XML, not HTML!

Me: As I say, our HTML news is well-formed XML. [Noch einmal]

Orgon son of Ubu

The original French:

Ondoyons un poupon, dit Orgon, fils d'Ubu. Bouffons choux, bijoux, poux, puis du mou, du confit ; buvons, non point un grog : un punch. Il but du vin itout, du rhum, du whisky, du coco, puis il dormit sur un roc. L'infini bruit du ru couvrit son son. Nous irons sous un pont où nous pourrons promouvoir un dodo, dodo du poupon du fils d'Orgon fils d'Ubu.

Un condor prit son vol. Un lion riquiqui sortit pour voir un dingo. Un loup fuit. Un opossum court. Où vont-ils? L'ours rompit son cou. Il souffrit. Un lis croût sur un mur : voici qu'il couvrit orillons ou goulots du cruchon ou du pot pur stuc.

Ubu pond son poids d'or.

In English translation:

"I'm going to rock this child in his cot," sighs Orgon, son of Ubu. "I'm going to wolf down mutton, broccoli, dumplings, rich plum pudding. I'm going to drink, not grog, but punch." Orgon drinks hock, too, rum, Scotch, plus two hot brimming mugs of Bovril to finish up with, which soon prompts him to nod off. Running brooks drown out his snoring. I stroll to rocks on which I too will nod off, with Orgon's dozing son, with Orgon, son of Ubu.

Condors swoop down on us. Poor scrofulous lions slink out, scrutinizing dingos with scornful looks. Chipmunks run wild. Opossums run, too, without stopping. North or south? I wouldn't know. Plunging off clifftops, bison splits limb in two. It hurts. Ivy grows on brick, rising up from stucco pots to shroud windows or roofs.

From Ubu's bottom drops his own bulk in gold.


Hello! I am an XML encoding sniffer

I am an algorithm that sniffs at byte streams that purport to be XML documents to figure out what character set is used to encode them. I start by checking the first four bytes of the stream to assign a tentative encoding. If I see:

  • 0xEF 0xBB 0xBF, I assign "UTF-8";
  • 0xFF 0xFE or 0xFE 0xFF followed by anything but 0x00 0x00, I assign "UTF-16";
  • 0x4C 0x6F 0xA7 0x94, I assign "EBCDIC-unknown";
  • Otherwise, I assign "unknown".

If the tentative encoding is "UTF-8", I return it. Otherwise I then read forward, ignoring all 0x00 bytes, until I find either a 'g' (0x67, or 0x87 on the EBCDIC code path) or an '>' (0x3C, or 0x4C on the EBCDIC code path).

In the former case, I sniff further for an apostrophe (0x27, or 0x7D on the EBCDIC code path) or a double quotation mark (0x22, or 0x7F on the EBCDIC code path). I then collect the encoding name following it until the next apostrophe or quotation mark, always ignoring 0x00 bytes, and return it. (On the EBCDIC code path, I need to translate it from invariant EBCDIC to ASCII first.)

In the latter case, there is no encoding declaration, and I return "UTF-16" if the tentative encoding was "UTF-16" or "UTF-8" otherwise. (On the EBCDIC code path, this is an error.)

Then someone else starts over from the beginning of the byte stream, decoding and parsing. I may return erroneous results if the document is not well-formed XML, but in that case there will certainly be errors detected by the parser.

Extreme 2005

I'll be presenting two tutorials at Extreme Markup 2005: one on Unicode, and one on RESTful Web Services. Please come to Extreme and sign up for these, because if there aren't enough people, they'll be canceled and I won't get there and that would be baaaaaaaad.


Almost everything I post here is coming from my outgoing mail archive, so if you use Google you can often find the original context. I do make improvements (corrections, clarifications, and so on), and these are the small subset of my postings I consider worth reading outside that context. I'm working through the archive from 1999 forward, so eventually I'll get to the fairly recent stuff.

Hey, I said it's recycled knowledge, but I didn't say how many times it had been recycled. One of the reasons I dropped off a lot of mailing lists at the end of 2004 was that I found I was repeating the same postings I had made years ago.



Anything that is actually implemented and deployed is by definition legacy. Googling for "legacy Java code" (exact phrase) today produced 444 hits, "legacy Java" by itself almost 30,000, and even "legacy XML" 619.

As usual, Frederick Brooks called it correctly thirty years ago in The Mythical Man-Month, chapter 1

[T]he product over which one has labored so long appears to be obsolete upon (or before) completion. Already colleagues and competitors are in hot pursuit of new and better ideas. Already the displacement of one's thought-child is not only conceived, but scheduled. [...] As soon as one freezes a design, it becomes obsolete in terms of its concepts.

Of course, he also says, with understatement that is less than usual in our profession:

The new and better product is generally not available when one completes his own: it is only talked about [...]. The real tiger is never a match for the paper one, unless actual use is wanted. Then the virtues of reality have a satisfaction all their own.

"th" as in "then"

English th represents two different sounds, the sound of th in thin and the sound of th in then. The two sounds are quite rare in the world's languages: only Greek and Icelandic among the European languages have both. Although they are as different as f and v or s and z, there has never been any way of distinguishing them in English orthography (in the IPA they are written Θ and ð respectively). This isn't as bad as it seems, because in fact the second sound is used only in a minority of words, which basically fall into four categories:

  1. in native English words between vowels
  2. in native English words at the end where there used to be a vowel that has been lost, often with a silent -e that represents it
  3. at the beginning in closed-class words
  4. just before a final m with no vowel separating them

Examples of the first group are: bother, brethren (formerly bretheren), brother, either, farther, father, fathom, feather, further, gather, hither, leather, mother, neither, nether, other, rather, slither, smithereens, smithy, smother, swarthy, together, weather, (bell)wether, whether, (where)withal, wither, withershins (a variant of widdershins, a rare word meaning 'counter-clockwise') worthy, and their inflected and derived forms. Either appears on this list, but ether does not, because it is borrowed from Greek. Thither belongs to both this group and the third group.

Examples of the second group are: bathe, bequeath, betroth, blithe, breathe, clothe, lathe, lithe, loathe, scythe, seethe, smooth, soothe, teethe, tithe, withe, wreathe, writhe and their inflected and derived forms. The verb mouth (not the noun mouth) also belongs to this category.

Examples of the third group are: than, that, the, their(s), them, then, thence(forth), there (and compounds), these, they, this, those, (al)though, thus, thy(self). All these words belong to closed classes, grammatically similar groups of words that aren't easily added to English, such as pronouns, conjunctions, and articles. Open classes like nouns, verbs, and adjectives don't normally have this sound at the beginning of the word: thigh is a noun and has the first th sound, whereas thy is a pronoun and has the second.

Examples of the fourth group (the only ones I can find) are algorithm, logarithm, and rhythm.

There are very few minimal pairs for these two th sounds; that is, pairs of words which are distinguished only because one has the first and the other the second sound. Thigh and thy, mentioned above, form perhaps the best-known pair. For some English-speakers, thin and then are likewise a minimal pair, because they do not distinguish between short e and short i before m or n.

Earlier varieties of English (and perhaps in some dialects still) had no phonemic voicing contrasts in fricatives. Since then, all fricative phonemes save h have split (f and v, s and z, Θ and ð, and ʃ and ʒ) but to varying degrees, and even now there are not very many minimal pairs for any of the four.


Borders, lots of borders

Which country is bordered by the greatest number of other countries?

It's a close race. Russia borders on Norway, Finland, Estonia, Latvia, Lithuania (via its Kaliningrad exclave), Poland (via Kaliningrad), Belarus, Ukraine, Georgia, Azerbaycan, Kazakhstan, China, and Mongolia, and is separated by only a few kilometers of water from Japan (in the Kuriles) and the U.S. (the Diomedes in the Aleutian chain).

Update: Russia also borders on North Korea, a border so short I missed it before, and (amazingly enough) on Switzerland: there is a monument to the Russian general Alexander Suvorov near Göschenen in central Switzerland that is legally Russian territory.

China borders on North Korea, Russia, Mongolia, Kazakhstan, Kyrgyzstan, Tajikistan, Afghanistan, Pakistan, India, Nepal, Bhutan, Myanmar, Laos, and Vietnam. If you call Taiwan a country, the Taiwanese island of Jīnmén (better known as Quemoy) is also only a few kilometers from China.

Why the Jews?

Hannah Arendt's book Eichmann in Jerusalem contains a very readable nation-by-nation survey of what went on in Europe during the Holocaust, and (to the extent possible) why. Denmark, e.g., was the only nation to make a determined politically based resistance: the Danish Jews were "Danish citizens of the Jewish faith", and as for the refugee Jews in Denmark at the time, Germany had declared them stateless and so could not expect to get them back.

Everywhere else, statelessness was damaging to Jews, but in Denmark, where the government had decided to protect the Jews, it paradoxically became an advantage. The story of the yellow star (if the Germans imposed it on the Jews, the King of Denmark would be the first to wear one) is probably apocryphal, but it sufficiently indicates the attitude of Danish citizens of all classes of society.

It turned out that the one and only time that the Nazis met resistance of this kind, they collapsed under it absolutely: it is very possible that the Danes were warned of the coming German roundup of the Jews (which found only about a hundred too old, too sick, or too isolated to hide) by a member of the German embassy in Copenhagen.

In Italy and Bulgaria, it was more a matter of humanitarianism. Although they were officially allies of Germany, all the preliminary moves required for the Final Solution were mis-executed, bollixed up, ignored, or (in one case) flatly refused "as inconsistent with the honor of the Italian Army".

In Arendt's view, Western political Antisemitism in general was a response to the general conditions after World War I, where the rich Jewish banking families were percieved as having prolonged the war to their own profit by lending money to both sides. In one of those switches all too common in history, a group earlier praised by cultural leaders for its cosmopolitan and pan-European attitude was now condemned for being insufficiently nationalist. (This is distinct from the social Antisemitism, centuries old, which was at the time rapidly declining in Western Europe.)

In the East the problem was different: the general settlement after World War Ihad to take into account the collapse of the Austro-Hungarian and Ottoman empires. The vast belt of mixed populations was then carved up in such a way that some groups (Serbs, e.g.) got their own countries and others (Croats, e.g.) did not, with consequences visible a century later. The Jews, although found everywhere, were nowhere in the majority, so it was impossible to create a Jewish autonomous region in any of the new countries. Therefore, the Jews (who were recognized as a distinct people by everyone, as opposed to the Western situation where "it was considered a sign of Antisemitism to call a Jew a Jew") ended up consistently on the bottom of the pecking order, blamed for everything that went wrong.

But read the book. This potted summary of a tiny section does not do it justice.