TagSoup 1.1 is now released at http://tagsoup.info. This release includes JAXP classes for those who want them. HTML comments have been completely revamped (the implementation has been broken for many releases), and a few other small bugs fixed. Do upgrade, if only for the comment handling.
2007-03-22
2007-02-24
Regularized Inglish: wordlists
The following words (from a list of the 1000 most frequent words) have the stressed vowel changed:
abuv, afternoone, agen, agenst, aul, aulmoste, aulreddy, aulso, aultho, aulways, amung, anuther, anser, eny, enything, ask, baul, bair (verb), bayr (noun), beutiful, beuty, becoz, becum, beloe, blud, bloe, branch, breik, bruther, braught, bild, bilding, bilt, bisness, bisy, bye (for "buy"), caul, can't, chance, chainge, Shicago, class, cullour/cullor, cum, cumming, command, cumpany, controel, cood, coodn't, cuntry, corse, cort, cuver, dance, dainger, ded, deth, demand, discuver, doo, duz, duzn't, dun, doen't, dor, dubble, erly, erth, yther/eether, Ingland, Inglish, enuff, example, ie (for "eye"), faul, flor, floe (for "flow"), foar (for "four"), frend, frunt, fooll (for "full"), guvernment, grant, grass, greit, groope, groe, haf, haul, hed, helth, herd (for "heard"), hart (for "heart"), heven, hevy, hight, insted, iorn, jurnal, knoe, knoen, knolledge, laf, lern, Lundon, looze, luv, loe, loer, mashien, meny, mesure, munney, munth, muther, moove(ment), nyther/neether, nun (for "none"), nuthing, wunce, wun, oenly, uther, aught (for "ought"), oen, peeple, plesant, plesure, pritty, proove, pooll (for "pull"), poot (for "put"), quorter, red (for "read"), reddy, receev, remoove, roel, sault, sed, ses, shoo, shood (for "should"), shoelder, shoe (for "show"), smaul, snoe, sum(thing,times), sun (for "son"), soel, spred, strainge, tauk, taul, tair, thare, tharefore, tho, thaught, thru, tuch, tord (for "toward"), trubble, too (for "two"), waul, wont (for "want"), wor (for "war"), worm (for "warm"), wos, wosn't, wosh, Woshington, wotch, wauter, wair, wether, whot(ever), whare, hoo, hoel, hoome (for "whom"), hooze, wooman, wimmen, wunder(ful), woen't, wurd, wurk, wurld, wood (for "would"), woodn't, yoo, yung, yoor(self), yoo'r, yoothe.
And here are the words that have an unstressed vowel changed. Most unstressed vowel spellings are left alone in Regularized Inglish, but these are changed to avoid confusion.
capten, certen, certenly, mounten, forren, felloe, folloe, folloeing, narroe, tomorroe, windoe, yelloe.
From the same list, changes involving adding or removing a final e:
afternoone, childe, wilde, behinde, finde, kinde, minde, winde (verb), sine (for "sign"), moste, poaste, bothe, truthe, coole, foode, foole, moone, roofe, roome, scoole, soone, gon, Europ, ar, wer, figur, promis, purpos, minut, hav, giv, liv, nativ, believ, leav, serv, twelv, themselvs.
And finally, words with changes in consonants:
caracter, dout, gard, onnour/onnor, our (for "hour"), iland, lissen, ov, offen, scoole, shugar, shure, studdy, sugest, sunn, winn.
Here are the analogous lists for the next most frequent 1000 words:
accumpany, ahed, aincient, eny(body,wun,way), arrainge, baught, boe, boel, bred, brekfast, brest, breth, braud, bery, boosh, cam (for "calm"), cassle, chaimber, clark/clerk, curnel, cumfort(able), cupple, currage, cuzin, daingerous, discuvery, disese, duzen, ern, everywhare, exchainge, faulen, faulse, flud, foek, faught, foarth, frend(ly,ship), foolly (for "fully"), glance, gloe, guvernor, greitly, groen, groth, hunney, improove, jurney, kee, lether, luvly, luver, mashienery, magazien, ment, utherwise, oe, oener, poliece, poar, prayr, poosh, quolity, quontity, quorrel, rainge, recaul, recuver, ruff, rute, roe, scaircely, serch, seeze, shoen, sloe (for "slow"), sloely, sum(wum,whot,where), saught, sorce, suthern, steddy, strainger, thare'z, thred, thretten, thruout, throe (for "throw"), throen, tung, tresure, unknoen, wonder (for "wander"), worn (for "warn"), welth, woolf, wunn (for "won"), wurker, wurry, wurse, wurship, wurst, wurthy, woonde, yoo'd, yoo'll, yoo'v.
automobiel, curten, equol, welcum, holloe, shaddoe, sorroe.
blinde, kindely, desine, worne, noone, shoote, smoothe, troope, handsom, determin, engin, examin, imagin, purchase, favourit/favorit, immediat(ly), opposit, privat, senat, separat (adj.), havn't, activ, representativ, observ, ourselvs, preserv, reserv.
ashure, clime, Crist(ian,mas), det, onnest, leag, Lincon, sacrifise, summ, shurely, sord, Tomas, whissle.
These lists probably contain minor errors of transcription.
Regularized Inglish: theory
I see I promised to post on Regularized Inglish (RI) back in 2005, but never got around to it. Here's a brief explanation.
Axel Wijk's Regularized Inglish is a massive multi-decade job (completed in the 1950s, so there's nothing available online about it) of analyzing practically every word in the language, figuring out what the complicated rules behind the spelling system really are, and identifying all the truly irregular words and proposing properly rule-governed spellings for them. English, e.g. is truly irregular in its first vowel only, and so it becomes Inglish.
The underlying principle of RI is that every spelling shall correspond to at most a few sounds, preferably only one; multiple spellings for a single sound, however, are tolerated. Thus -ough is kept for bough, but not for rough, through, plough, hiccough, hough, or borough. Why choose bough? In order to consistently apply the RI rule that says "gh has no effect on the pronunciation of any word".
In my opinion, Wijk goes a bit far in a few places: for example, he introduces dh for the sound of th in father in other than initial positions (the, not dhe) for very little gain; he sorts out long "a" into a as in fat and "aa" as in father; he changes s to z when pronounced that way, except in the plural of nouns (not nounz)and the third-person singular ending of verbs. I wouldn't bother with any of these changes, which have little impact on being able to pronounce words at sight.
But as reformed (not revolutionized) spellings go, RI is a Great Thing.
2007-02-19
A haiku
Quantum mechanics:
Which moves through the springtime pond?
Frog or a ripple?
2007-02-18
The heart of Celebes Kalossi
I'm finally ready to explain the central ideas of my object-oriented programming model Celebes Kalossi (read the linked post first for the terminology). I've been hanging fire on it for a long time over a terminological issue, which I've decided to just punt on.
In CK, there are two kinds of relationships between classes: subtyping and incorporation. Subtyping is like the relationship between a Java interface and its superinterface; incorporation is something like C++ private inheritance. These two concepts are intertwingled in various ways in various programming languages, but CK completely separates them. Every class subtypes one or more classes except for the root of the class hierarchy (if there is one); classes can incorporate zero or more other classes.
When you declare that class A subtypes
class B, you mean that A (the subclass) has all the methods that are declared or defined public
in B (the superclass) and all of B's superclasses, plus those declared or defined public
in A itself. You are not saying that any of the definitions provided in B or its superclasses are or are not available in A, so there is no problem if a given method is public
in more than one superclass. You are also suggesting that an instance of A is Liskov substitutable for an instance of B, although it is impossible to check this property mechanically. We call the public
methods of A and its superclasses the interface of A.
When you declare that class C incorporates
class D, on the other hand, you mean that the non-private
methods defined in class D (the incorporated class) are effectively given the same definition in class C (the incorporating class). Provided that the methods in D have been properly declared in class C, they can be invoked on objects of class C just as if they had been defined as standard or public
methods of class C. It does not matter if a method is defined in class D and declared in class C (which is C++ private inheritance) or vice versa. However, the fact that a method is public
in class D does not make it public
in class C unless it is declared public
in class C or one of its superclasses: incorporation affects behavior but not interface.
For convenience, we say that a class subtypes itself and incorporates itself. Both incorporation and subtyping are transitive: class A subtypes class B's superclasses as well as class B itself, and classes incorporated by class D are implicitly incorporated by class C as well. All the methods in the incorporated classes of C are placed on an equal footing: it does not matter how they were incorporated. A loop in the subtype hierarchy specifies that all the types in the loop have the same interface; a loop in the incorporation hierarchy is ignored, so even if C incorporates D, D can also incorporate C. Note that an implementation may provided types which are outside the model: typical examples would be numeric types, strings, and exception objects.
If a method is declared in a class X or in any of the classes incorporated (directly or indirectly) by class X, it must also be defined exactly once in one of those classes. (If there are no definitions of some method, the class is abstract
and must be declared as such.) If incorporating class Z would cause conflicts, CK provides for renaming or hiding unwanted methods when incorporating a class: class L can incorporate class M including
specified methods (in which case all others are hidden), excluding
specified methods, or renaming
a method as
a new name. Classes that incorporate L see the changed view.
There are no inheritance rules in CK, because there is no inheritance as such: if you want X to use the same implementation of method m as its superclass Y, then incorporate whichever class Z from which Y gets its implementation of method m. You can do the same thing for classes that are not related to you in the superclass hierarchy, or even for your subclasses if you really want to -- the model is nothing if not flexible.
I'll have another post, hopefully soon, about how CK might be implemented on the JVM or CLR.
2007-02-03
TagSoup 1.0.3 released!
For TagSoup users among my readers, there's a new release at http://tagsoup.info, providing control of the output encoding and fixing a few bugs.
If you aren't a user but think you might like to be, TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Update: TagSoup 1.0.2 had some brown-paper-bag bugs, so I've released 1.0.3 as a replacement.
2007-01-11
The old farts have it
Age and treachery, it is said, beat youth and X every time. But what is X?
Googling suggests that this part of the proverb isn't very stable. I find: enthusiasm, skill, idealism, enthusiasm, innocence, inexperience and experience, endurance, exuberance, ability, virility, skills, reflexes, speed, talent, baggy pants, and a bad haircut.
2006-11-28
Being a HAXEor
(thanks to d8uv for the title)
I usually describe myself as an "'ex' troglodyte", because I prefer the Unix line editor ex(1) to all other text editors. This makes people look at me like I'm something they found by turning over a rock, but what do I care?
(I know that ed(1) is the standard text editor, but I'm willing to trade off a little minimalism for a little convenience.)
Anyhow, in the tradition of Tim Bray's MARS, I will now say that I make my web site with HAXE, standing for HTML, Apache, and ex (reversed for euphony and cuteness).
Update: Though I don't like cokebottle editors, much of what's said in The Case For Emacs is relevant to me. I don't use ex(1), it is part of me.
2006-11-13
"Irrumabo vos et pedicabo"
In my first year of college, long ago,
I took a class on Ovid and Catullus.
One of the sexual poems I found confusing,
and the book we were using
was quite devoid of commentary on it,
grammatical or otherwise.
So at the next class, I asked my professor
what the poet meant by such-and-such.
He was hesitating, doubtful, maybe-yes-maybe-no.
Yet at the following meeting of the class,
he was entirely changed:
he explained forthrightly just how the poem worked.
I could not understand the sudden change
until I looked about the studentry
and saw the only female student
absent that day.
I was shocked and outraged --
naïve nerd from a feminist family that I was --
to think that a professor! of the liberal arts!
and of Latin of all things! could be so sexist,
so crude, so utterly indifferent to his duties
to all his students.
Many years later, it occurred to me to wonder
if he had sunk so low as to ask her
to be absent that day so that he could answer my questions.
All the worse, I thought.
All the worse.
Looking back today, I think:
perhaps he was, poor man, in a cleft stick,
caught between the fear of being accused of harassment
by the woman for openly discussing sex in class,
and the fear of having his dean (who happened to be my mother)
coming down on him for neglecting the questions
of her precious darling (little did he know
that while she might have disapproved,
she would never have punished him for that --
my mother believed in justice).
It's a hell of a thing
when students can't learn
for fear or for shame
what the poets sing.
2006-11-12
An annoying ambiguity about which nothing can be done now
The phrase "COMBINING DOUBLE" in a Unicode character can mean either of two things. Sometimes the diacritical mark is doubled with respect to some other mark:
- U+030B COMBINING DOUBLE ACUTE ACCENT ̋
- U+030E COMBINING DOUBLE VERTICAL LINE ABOVE ̎
- U+030F COMBINING DOUBLE GRAVE ACCENT ̏
- U+0333 COMBINING DOUBLE LOW LINE ̳
- U+033F COMBINING DOUBLE OVERLINE ̿
- U+0348 COMBINING DOUBLE VERTICAL LINE BELOW ͈
- U+035A COMBINING DOUBLE RING BELOW ͚
- U+20E6 COMBINING DOUBLE VERTICAL STROKE OVERLAY ⃦
But sometimes it means that the mark extends over two characters, the one it applies to and the following one:
- U+035D COMBINING DOUBLE BREVE ͝
- U+035C COMBINING DOUBLE BREVE BELOW ͜
- U+035E COMBINING DOUBLE MACRON ͞
- U+035F COMBINING DOUBLE MACRON BELOW ͟
- U+0360 COMBINING DOUBLE TILDE ͠
- U+0361 COMBINING DOUBLE INVERTED BREVE ͡
- U+0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW ͢
Of course U+1D18A MUSICAL SYMBOL COMBINING DOUBLE TONGUE 𝆊 is something else again.
Thank you. I feel much better now.
The lackmus test
In a post to one of the innumerable technical mailing lists I belong to, a native speaker of German used the phrase lackmus test, meaning a simple method for detecting differences. In English, the phrase is litmus test; why the difference?
Middle English had both the native English word lykemose and the Scandinavian borrowing litemose; only the latter has survived. The second morpheme in each case is that of English moss, but the first morphemes are different, meaning 'drip' and 'dye, color' respectively.
Litmus is made by drying and powdering certain lichens; it was originally used as a water-soluble dye, but is now generally used as a quick-and-dirty test for acidity, hence the metaphorical use of the term (it turns red in acids, blue in bases).
East is west and west is east
Little Diomede Island (U.S.) in the Bering Strait (not the Aleutians, as I mistakenly wrote earlier) is reckoned to be some tens of thousands of kilometers west of Big Diomede Island (Russia), despite the obvious fact that Little Diomede is about four kilometers east of Big Diomede.
The reason for that is that in the state of nature, Europe is east of North America, which is east of Asia, which is east of Europe. So it makes no sense to ask "Is X east or west of Y?" unless we have instituted a convention of some sort.
One possible convention is: "X is east of Y if and only if the easterly great-circle course between them is shorter than the westerly one." That's the rule we apply in ordinary life, and by that rule, the Russian island is west of the U.S. one.
But the navigator's convention unwraps the globe at the 180 degree meridian, and says that the entire Eastern Hemisphere is east of the entire Western Hemisphere. Using this convention, the Russian island is east of the U.S. one.
And by the same token, Alaska, since it sticks into the Eastern Hemisphere, is the easternmost U.S. state as well as the westernmost and the northernmost. The southernmost state is Hawaii. Of the 48 contiguous states, the westernmost is Washington, the easternmost Maine, the southernmost Florida (thanks to Key West), and the northernmost Minnesota, due to a surveying error.
Say who?
I got four stupid financial spams with interesting From: lines the other day. These are the ones that pick dictionary words for first and last names: what is that supposed to be about, anyhow? Anyhow, here they are:
- The peculiar Queueing M. Secretively,
- The paradoxical Tough D. Frailty,
- The malapropos Foolhardiness T. Phoneticians,
- And the Marxist (tendance Chico) Spumoni P. Brickbat.
I guess it was the last one that put me over the top.
2006-10-20
Recording your phone calls.
Can you record your own phone calls, ingoing and outgoing? Usually, at least in the United States.
Most states are “one-party-consent law” states. If you live in one of these, you can always record your own in-state calls either openly or surreptitiously, since only one participant’s consent is needed. Likewise, you can get someone else to record them for you.
In interstate calls, it’s important to check this state-by-state summary, because in interstate calls, both states’ laws apply, and you need to apply the most stringent applicable law. For example, if you live in California or are even just speaking to someone in California (an “all-party-consent law” state), you must get the other party's permission to record the call, or risk having to pony up $5000 in statutory damages (or three times the actual damages, whichever is greater). In general, announcing your intent to record and letting the other party hang up if they don’t like it is sufficient in all states: continued participation implies consent.
The all-party-consent law states are: California, Connecticut, Florida, Illinois, Maryland, Massachusetts, Michigan, Montana, Nevada, New Hampshire, Pennsylvania, Washington. In Delaware, Indiana, Iowa, Missisippi, and probably New Mexico as well, a participant may record but a non-participant may not, even with consent. In Vermont the law is unsettled.
I am not a lawyer; this is not legal advice; laws change; errors happen.
2006-08-23
Slashdot, eWeek, Microsoft, the OSI, Groklaw, and me
Well, it seems I've made Slashdot, quite unintentionally. The article there references an eWeek article about how I proposed that the Open Source Initiative approve two Microsoft licenses, the Microsoft Permissive License and the Microsoft Community License. Here's a FAQ:
- Why is this story news in August 2006? Ya got me. Groklaw reported on it back in December 2005, when it was in fact news.
- Do you speak for Slashdot, eWeek, Microsoft, the OSI, Groklaw, or any of your past or present employers? No, only for myself.
- Is what the eWeek story says about you true? Yes, except that I no longer volunteer for ccil.org; I did some work for them in the past.
- Why did you propose the licenses for OSI approval? Because I believe they meet the elements of the Open Source Definition.
- Are the licenses basically similar to other OSI-approved licenses? Yes.
- Then why ask OSI to approve them? Because I want to encourage Microsoft to release software under an OSI-approved license, even if they feel it necessary to use their own license
- Microsoft release anything under an Open Source license? Surely you jest. No, actually. Microsoft released WiX under the Common Public License, an OSI-approved license. And there have been other such releases.
- Why did you withdraw the request for OSI approval? For a number of reasons, it's awkward for OSI to approve licenses that are not proposed by the author of the license. The OSI wants to keep all approved licenses on its site, and may not have copyright permission to do so. Furthermore, if the OSI wants to request changes, only the author can make them.
- Does that mean you have changed your mind about the licenses? No, only about the suitability of OSI approving them.
- Are you a shill/astroturfer for Microsoft? No.
- What's your view on open-source software? I use a lot of it and have released my own code and other stuff under several different open-source licenses.
- What do you want to do with your fifteen minutes of fame? Wait for it to pass.
- Can I leave a comment? Yes. However, as Le Guin says, I can take a little inaccuracy or a little accusation, but the combination is poison. I reserve the right to remove comments I think are poisonous.
2006-08-18
The Elements of Style Revised
I've been working off and on for the last few weeks on updating the original 1918 edition of William Strunk's short book on the basics of elementary composition. No, it isn't "Strunk and White"; White's additions are still in copyright and thus untouchable. Nor is it the book I would have written myself from scratch; that would look a lot more like Mapping the Model, except Rosemary Hake has already written it, so why should I? (Alas, that book is out of print....)
Here's part of the Reviser's Introduction, so you can see if it's for you:
My revisions to the original are founded on the principle that rules of usage and style cannot be drawn out of thin air, nor constructed a priori according to "logic"; they must depend on the actual practice of those who are generally acknowledged to be good writers. For a larger work founded on the same principles and giving much more detailed and up-to-date advice on usage, the reader is urged to consult the current edition of Merriam-Webster's Concise Dictionary of English Usage, as I have done with both pleasure and profit while preparing this revision.I have attempted to remain within the scope of the original. This book, therefore, is intended as a compendium of helpful advice to novice writers in freshman composition classes, not a code of general laws of writing for all works by all writers in all circumstances. Violations of the rules can be found within the book itself — this is neither inconsistent nor hypocritical, as The Elements of Style Revised is not a paper written for a composition class.
In updating Strunk's work from the 19th century to the late early 21st century, I have retained as much of Strunk's spirit and characteristic style as I could. I have removed the obsolete, the erroneous, and the merely idiosyncratic (Strunk's arbitrary dislike of "student body", for example) both from Strunk's own usage and from the rules laid down in his book. Like White, I have also added a few points to Chapters IV and V that seemed to me important enough to justify their presence, as well as removing Strunk's Chapter VI on spelling. I have not hesitated to replace Strunk's opinions with contrary ones, though I was pleasantly surprised to find that many of those I expected to require changing (strictures against split infinitives and final prepositions, as well as the preposterous which/that rule) did not appear in the 1918 edition at all.
Share and enjoy, and of course send me critiques.
2006-07-18
Well, wuddaya know
Here's a poem in blank verse:
Oh, moralists, who treat of happiness
And self-respect, innate in every sphere
Of life, and shedding light on every grain
Of dust in God's highway, so smooth below
Your carriage-wheels, so rough beneath the tread
Of naked feet, bethink yourselves
In looking on the swift descent
Of men who have lived in their own esteem,
That there are scores of thousands breathing now,
And breathing thick with painful toil, who in
That high respect have never lived at all
Nor had a chance of life! Go ye, who rest
So placidly upon the sacred Bard
Who had been young, and when he strung his harp
Was old...
Go, Teachers of content and honest pride,
Into the mine, the mill, the forge,
The squalid depths of deepest ignorance,
And uttermost abyss of man's neglect,
And say can any hopeful plant spring up
In air so foul that it extinguishes
The soul's bright torch as fast as it is kindled!
Who wrote it? Well, Charles Dickens, in Martin Chuzzlewit. What's that? You didn't know Dickens was a poet? Well, the above passage appears, printed as prose, in Chapter 13, with the additional words "and had never seen the righteous forsaken, or his seed begging their bread" between the two stanzas, quoted from Ps. 37:25. Such things can prose writers fall into when they are trying to be high-flown and not watching themselves carefully.
Kudos to H. W. Fowler for spotting this example.
2006-06-19
Futbol en masse
At the summer camp I attended as a yoot in the 60s, there was a game known as Mass Soccer. The chief feature of this game of games was that any number could play on either side. Kids being what they are, this meant that the entire center of the field was occupied by a permanent clot of forwards of all shapes and sizes, with a relative handful of backs whose chief task was to assist the goalies when a ball occasionally escaped from this huge scrum.
To keep things interesting, any number of balls could be in play at once. To keep things fair, each side was entitled to as many goalies as there were balls — when a ball went out of play, one of the goalies would do the usual thing to bring it back in while the game raged on around him. To keep things safe, the footwear rules were strictly enforced: no shoes allowed, socks required.
It was a hell of a lot of fun.
2006-06-15
TagSoup 1.0 Final released!
TagSoup is free and Open Source software, licensed under the Academic Free License version 3.0, a cleaned-up and patent-safe BSD-style license which allows proprietary re-use. It's also licensed under the GNU GPL version 2.0, since unfortunately the GPL and the AFL are incompatible. You can choose to license TagSoup from me under either the GPL or the AFL.
This release represents the end of my current plans for TagSoup. I will continue to fix bugs, but it now does everything that I foresaw back in 2002 when I started this project, and a great deal more. Thanks to everyone on the tagsoup-friends mailing list for their efforts.