2005-08-31

On not using more security than you need

This post does not represent the views of my employer or anyone else but me.

Using secure transmission channels involves two main issues: content security and end-to-end authentication. Consider the case of someone who provides news to a variety of paying customers. Does it make sense to use something stronger than just ordinary FTP or HTTP with usernames and passwords sent in the clear? Maybe not. Communications security isn't free, after all; it costs the provider for certificates, bandwidth, and computer cycles, and it costs the client likewise, plus the extra programming effort if the news is to be automatically processed.

The information in news is essentially all public knowledge: its value resides in its timeliness and reliability. Typical customers for news pay by an annual contract, not per download, so they have nothing to lose if the content is stolen by some third party. Does the news provider? Probably not, unless the theft happens on a truly massive scale. Occasional freeloaders are simply no big deal.

As for reverse authentication (is the customer getting its news from the real news provider?), a successful DNS spoof would be far more effectively employed against some e-business site that actually passes around credit card details or the like. News just isn't in that category.

Finally, why would anyone pay to get news in the first place that is available on many websites for free? For one thing, news changes, by definition; this tends to improve the stickiness of sites that display it; when users return to the site, they will find that things have changed at least somewhat. For another, the client may believe that the news provider's credibility (a non-credible news provider doesn't last long) will rub off on him.

Powerpoint, headlines, and captions

Lots of people lately have been denouncing Powerpoint presentations, a term I am here using generically for slides full of bullet-point text. If you want to make or view such presentations, use OpenOffice.org, and if you have numbers to show, use proper graphs instead of textual representations.

But I want to talk about a different point. When you look at a Powerpoint, you find that quite frequently you can't understand it without the talk that went with it; it's just a collection of phrases with no indication of what they might refer to. But I've been told by several people that the four technical presentations on my website are quite intelligible from the slides alone even without my actual talk.

I have a theory about why this might be so. The bullet points on my slides are headlines, whereas the points in this wonderful parody of Lincoln's Gettysburg Address (via Language Log) are simply captions. The difference is that a headline is a complete sentences, possibly with a few omitted words, whereas a caption is just a noun phrase.

Newspapers have been around for a long time, but headlines are just over a century old: the Hearst papers pretty much invented them as part of hyping the Spanish-American War. Before that, in the Civil War, for example, war news was typically headed "The War" or some equally nondescript caption. It wasn't until early in the 20th century that the principle "The headline tells the story" was fully adopted.

Headlinese, at least English-language headlinese, isn't quite grammatically equivalent to ordinary English. Articles are often left out, and so is the copula is: headlines have to fit into a confined space across the column. It's always straightforward, though, to supply the missing words in order to reconstruct the original. When it isn't, you get the broken headlines that appear on the lower left corner of the Columbia Journalism Review home page.

One entirely modern newspaper headline did appear as long ago as 1781, however, announcing the outcome of the battle of Yorktown, the last major battle of the American Revolution. Two words told the story: CORNWALLIS TAKEN!

2005-08-29

The not-so-British Commonwealth

What are the Commonwealth countries?

Antigua and Barbuda, Australia, Bahamas, Barbados, Belize, Botswana, Canada, Grenada, Jamaica, New Zealand, Papua New Guinea, Solomon Islands, St Kitts and Nevis, St Lucia, St Vincent and the Grenadines, Tuvalu, U.K. (17 countries) are constitutional monarchies ruled by Queen Elizabeth II. Except for the U.K., the Queen delegates her powers to a Governor or a Governor-General.

Malaysia, Brunei, Lesotho, Samoa, Swaziland, Tonga (6 countries) are monarchies with other monarchs as heads of state. The Malaysian monarch is chosen by his peer monarchs in the Malaysian Federation from among themselves; the others are straight hereditary monarchies.

Bangladesh, Dominica, Fiji, India, Malta, Mauritius, Pakistan, Singapore, Trinidad and Tobago, Vanuatu (10 countries) are parliamentary republics.

Cameroon, Cyprus, Gambia, Ghana, Guyana, Kenya, Kiribati, Malawi, Maldives, Mozambique, Namibia, Nauru, Nigeria, Seychelles, Sierra Leone, South Africa, Sri Lanka, Tanzania, Uganda, Zambia, Zimbabwe (21 countries) are presidential republics. Mozambique is unusual in being a former Portuguese rather than a former British colony (though some have said it is the colony of a British colony...)

Note that Ireland, a presidential republic, is not part of the Commonwealth.

Horhorn, quickening, and wombfruit

Someone asked me the meaning of the following signature line that I use:

Deshil Holles eamus. Deshil Holles eamus. Deshil Holles eamus.
Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x)
Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa!

saying: "It reads like the gibbering of a schizophrenic. Is it anything but?"

It is, indeed, anything but. The text is from the "Oxen of the Sun" chapter of James Joyce's novel Ulysses, and consists of remarks that can be heard in and around Holles Street maternity hospital in Dublin. Deshil is Irish for "street", eamus is Latin for "let's go"; the speakers are medical students, who know enough Latin and Irish to fool around in both languages, even simultaneously.

The "Send us" line is the prayer of the expectant mothers. I do not know exactly what "Horhorn" means, but "quickening" is pregnancy, and "wombfruit" is a baby, with reference to the familiar prayer to Mary: "... and to the fruit of thy womb, Jesus." Indeed, the German version "Schick uns, du Heller, du Lichter, Horhorn, Leben und Leibesfrucht" leaves "Horhorn" untranslated. (3x) is just my way of saving space; the original repeats this line, like the others, three times.

As for the last line, it represents what someone (the midwife or obstetrician) is saying at the delivery of a male infant, perhaps a first child (they tend to linger a long time). The German version is straightforward: "Hopsa, ein Jungeinjung."

Dreein the weird

Scots has two verbs corresponding to English "endure, put up with":

thole
to put up with something because one has no choice
dree
to put up with something as a choice

Vocabularists may be interested in this contrast. I found it at a page of Scots prescriptivism written in Scots.

The phrase "dree one's weird", therefore, means not merely to endure one's fate, but to choose to endure one's fate.

All the character codes in the world

This is not a proposal to change standards in any respect. It's just a thought-out (well, somewhat) approach for people who have to represent character codes as opposed to characters, and have 32 bits to play with.

The intent is to represent all the codes of all the registered character sets, present and future, as individual unsigned 31-bit integers. All further numbers in this post, except 94, 96, and 2022, are base 16.

Unicode codes are mapped onto the integers 0-10FFFF in the obvious way. The registered character sets of ISO 2022 are represented by codes above 2000000.

The detailed roadmap is as follows:

  • 00000000-0010FFFF: Unicode
  • 00110000-1FFFFFFF: reserved
  • 20000000-2003FFFF: ISO 2022 94-char, 96-char, C0, and C1 character sets
  • 20040000-2093FFFF: ISO 2022 94x94/96x96-char character sets
  • 20940000-5693FFFF: ISO 2022 94x94x94/96x96x96-char character sets
  • 56940000-7FFFFFFF: reserved

Definitions for ISO 2022 character sets:

  • Every character set has an ISO-specified value between 40 and 7E, called F.
  • Some character sets have an ISO-specified value between 21 and 2F, called I. If I is not present, it is deemed for our purposes to 20.
  • Individual characters in one-byte character sets have a value between 20 and 7F, called H.
  • Individual characters in two-byte character sets have two values between 20 and 7F, called H and L.
  • Individual characters in three-byte character sets have three values between 20 and 7F, called H, M, and L.

Values:

  • The value of a character in Unicode is its scalar value.
  • The value of a character in a 94-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H.
  • The value of a character in a 96-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H + 80.
  • The value of a character in a 94x94-char or 96x96-char character set is 20040000 + (I - 20) * 90000 (F - 40) * 2400 + (H - 20) * 60 + (L - 20).
  • The value of a character in a 94x94x94-char or 96x96x96-char character set is 20940000 + (I - 20) * 3600000 + (F - 40) * D8000 + (H - 20) * 2400 + (M - 20) * 60 + L.

This scheme was inspired by a related scheme by Markus Kuhn.

2005-08-22

My nose is shinier

I quoted the following exchange of dialogue on a mailing list a few years ago:

"I know more big words than you do!"
"But I can spell them better."
"My hair is wavier."
"My nose is shinier."
"But listen freak, do you have your very own Galactic technology flies-through-the-air SURFBOARD, hmmmmmm?"
     --Mr. Fantastic and Dr. Doom

Not unnaturally, another participant wanted to know what Norrin Radd was doing in the conversation all of a sudden. I explained that at this particular point in the Marvel Universe, Dr. Doom had seized control of the Power Cosmic (including the surfboard) from the Silver Surfer. The tale's retold here and here; the latter page displays marvelous cover art showing Victor Von Doom Silver-Surfer-ized.

But as you might well suppose, this particular conversation is not from any straight Fantastic Four issue. Rather, it comes from Marvel's short-lived late-60s self-parody, Not Brand Ecch, which featured (among other things) the adventures of Charlie America, the Inedible Bulk, and the Mighty Sore*, as well as Forbush-Man, the superhero identity of Marvel janitor Irving Forbush, frequently referred to in the letters column.

* Punchline of a joke: "You're Thor? I won't be able to thit down for a week!

Unicode is big enough

People tend to be skeptical that the 17 * 65536 = 1,114,112 character codes provided by Unicode will be big enough. After all, we have moved from 8-bit to 64-bit computers, both in word size and in address size; in general, most finite limits have been repeatedly shown to be insufficient. The maximum normal memory on MS-DOS-based PCs was 640K, ten times as big as the 64K limit on the 8-bit systems that preceded them: after all, as Bill Gates supposedly said back in 1981, 640K of memory ought to be enough for anybody!

In fact, though, there just aren't any huge and complicated writing systems hiding in some remote ravine. We have a pretty good map of all the writing systems on the planet; a few may have been overlooked by accident, but none of them are going to be huge. The biggest remaining ones are Egyptian hieroglyphics and ancient Chinese characters, and neither of them will require anything like a million character codes.

There are other ceilings in computing that aren't likely to be broken through either. Consider the number of different assembly-language op codes. Does anyone foresee computer chips with 65,536 different opcodes? How about 4,294,967,296 distinct opcodes? I don't think so.

Or consider IP version 6 network addresses. There are 2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 of them. They won't be assigned densely, according to current plans, but they could be, and that would be enough IP addresses to have a few billion addresses for every soil bacterium in every square centimeter of soil on the planet. Does anybody really believe we are going to "break through" that?

Celebes Kalossi: Who knows best?

This post expands and enlarges on my previous post OOP Without Inheritance.

Celebes Kalossi (CK) is the name of a model of object-oriented programming I'm developing. It's also the name of a (currently hypothetical) programming language that implements it. Here I'm going to talk about how objects are constructed in CK. There will be later posts that discuss various other features. I'll say right off that CK is class-based (loosely speaking) rather than delegation-based like Javascript or Self.

Mainstream OO programming languages like Smalltalk, Java, C++, and C# natively support a Daughter Knows Best (DKB) model: a method in a subclass overrides the correspondingly named method in a superclass. All except Smalltalk also require that the static types of the arguments match in order for an override to be successful. I say Daughter Knows Best because the overridden method need not be invoked at all unless the overriding method decides to do so.

Simula and Beta, per contra, use a Father Knows Best (FKB) model: the superclass method is invoked first and foremost. In fact, subclass methods are not invoked at all unless the superclass method decides to do so. This is perfectly symmetrical with the "Daughter Knows Best" model.

There are advantages and disadvantages to both models. In DKB, the subclass can take complete control, which is sensible because it understands its own needs better than the superclass. However, when a superclass method invokes a subclass method under the guise of invoking its own method (because the self/this object belongs to the subclass), it has no guarantees that the subclass method will Do The Right Thing. In FKB, the subclass is stuck with the behavior that the superclass imposes, for good and bad: good, because the superclass can prevent the subclass from going off the rails; bad, because the superclass may do things the subclass does not want.

In CK, all classes are equal, and specify exactly which behavior is appropriate for them. This is achieved by partitioning the notion of class into two notions: type and behavior. To create an object, you specify its type: in statically typed Celebes Kalossi, variables specify their types. The methods you can invoke on a type are the methods it declares public, plus the methods declared public by all its supertypes. A type can have more than one supertype. Consequently, a type is like an interface in Java or C#, except that types can have instances, whereas interfaces can't.

A behavior is simply a named set of methods plus instance variables. Objects cannot be created nor can variables be typed using the name of a behavior. The instance variables are always private to the behavior alone, so if you want to make them accessible to the outside world, you must provide getter and/or setter methods for them within the behavior. The methods, however, can be specified as private, external, or standard (no keyword in the syntax). Private methods, like instance variables, are defined by the behavior and visible only within the behavior. External methods can only be declared, not defined, and indicate which methods this behavior depends on but does not itself define. Finally, standard methods are defined by the behavior and are available in types that use the behavior.

What do I mean by "use the behavior"? In the declaration of a type, the programmer can specify that it has no behaviors, in which case it is an abstract type, or the programmer can specify one or more behaviors. The behaviors must fit together like jigsaw puzzle pieces: if a method is declared as external in one or more behaviors of the type, it must be defined with a standard definition in exactly one behavior of the type. Furthermore, each of the public methods of the type (including the public methods of its supertypes) must correspond to a standard method defined in a single behavior. However, behaviors are never inherited from supertypes: a type that wishes to have the same behavior as a supertype must specify that behavior explicitly.

The model also provides control of the visibility of individual method names. When a type declares that it uses a behavior B, it may mention standard methods to be suppressed or renamed. A suppressed method is not visible to the jigsaw-puzzle mechanism described above; this allows creating types whose behaviors have method names that accidentally clash. Renaming a method allows it to be invoked from other behaviors under a different name. This allows the DNB model to be simulated: the subtype uses its supertype's behavior, suppressing conflicting methods that are not going to be called, and renaming ones that are. Thus there is no mechanism in CK corresponding to super in DKB languages or inner in FKB languages.

There is some syntactic sugar that makes the model easier to use. In particular, a behavior A can declare that it uses another behavior B. This simply means that whenever a type uses behavior A, it also automatically uses behavior B; no special relationship between A and B is necessarily implied. Furthermore, types can define instance variables, private methods, and standard methods as well as declaring external methods. In terms of the model, these things are really declared in an anonymous behavior which the type automatically uses.

Behaviors are something between mixins and traits. Traits don't have instance variables, which in the traits model are defined within classes that incorporate the traits. Mixins have instance variables that are visible to all other mixins within the class, bringing in all the problems of unrestricted multiple inheritance. Behaviors share the orthogonality of traits: you can just combine them without worrying about the order in which they are to be combined, since identical instance variable names are irrelevant (instance variables being private), and identical method names are forbidden unless renamed or suppressed. Since each behavior carries its own state with it, and behaviors are plug-replaceable, one can create closely related types, or even equal types (loops in the subtype-supertype relationship are not forbidden) that implement variant behaviors.

2005-08-09

HTML attributes that are URIs

This is a list of HTML 4.01 attributes of type URI and what they are about.

Element/@attrExplanation
a/@hrefhypertext reference
applet/@codebaselocal base URI
area/@hrefhypertext reference
base/@hrefglobal base
blockquote/@citesource of quotation
body/@backgroundbackground image
del/@citeexplanation of change
form/@actionserver-side form handler
frame/@longdesclong description
frame/@srcsource of frame content
head/@profilenamed dictionary of metainfo
iframe/@longdesclong description
iframe/@srcsource of frame content
image/@longdesclong description
img/@srcimage reference
img/@usemapclient-side image map
input/@srcimage references
input/@usemapclient-side image map
ins/@citeexplanation of change
link/@hrefhypertext reference
object/@classididentifies an implementation
object/@codebaselocal base URI
object/@dataobject's data
object/@usemapclient-side image map
q/@citesource of quotation
script/@srcscript

Knowing knowledge

The following four rules explain what it is to know something. X knows the proposition p if and only if:

  1. X believes p;
  2. p is true;
  3. if p weren't true, X wouldn't believe it;
  4. if p were true, X would believe it.

Why do we need the fourth condition? To eliminate what are called Gettier counterexamples. This one is due to Quine:

Four men set sail from Boston on 8 November 1918 with the justified false belief that the War in Europe was over (because reports to that effect had been circulated in the newspapers). They arrived in Bermuda four days later with no further information, but now their belief was true. However, it did not count as knowledge, because the justification and the truth were entirely independent of each other.

Doing it yourself

Once there was an internetworking protocol named DECnet, which like IP used 32-bit addressing. When the time came to support Ethernet and similar LANs in DECnet Phase IV, the mapping of DECnet node addresses to 48-bit Ethernet MAC (hardware interface) addresses was solved by changing the MAC address in software to be the same as the DECnet address! (DEC did set the "locally administered" bit in all such MAC addresses.)

When the programmer responsible was located and asked "Why didn't you use ARP?", the IP protocol for dynamically mapping IP addresses to MAC addresses, he simply replied "What's ARP?"

2005-08-05

Mwahahahahahah!

I just found out from a Time Magazine interview that J. K. Rowling pronounces her villain's name with a silent final t. My middle name is pronounced German fashion, with the initial letter like an English v.

Sincerely yours,

John Woldemar Cowan