2007-02-24

Regularized Inglish: wordlists

Here are some Regularized Inglish wordlists from Wijk.
The following words (from a list of the 1000 most frequent words) have the stressed vowel changed:
abuv, afternoone, agen, agenst, aul, aulmoste, aulreddy, aulso, aultho, aulways, amung, anuther, anser, eny, enything, ask, baul, bair (verb), bayr (noun), beutiful, beuty, becoz, becum, beloe, blud, bloe, branch, breik, bruther, braught, bild, bilding, bilt, bisness, bisy, bye (for "buy"), caul, can't, chance, chainge, Shicago, class, cullour/cullor, cum, cumming, command, cumpany, controel, cood, coodn't, cuntry, corse, cort, cuver, dance, dainger, ded, deth, demand, discuver, doo, duz, duzn't, dun, doen't, dor, dubble, erly, erth, yther/eether, Ingland, Inglish, enuff, example, ie (for "eye"), faul, flor, floe (for "flow"), foar (for "four"), frend, frunt, fooll (for "full"), guvernment, grant, grass, greit, groope, groe, haf, haul, hed, helth, herd (for "heard"), hart (for "heart"), heven, hevy, hight, insted, iorn, jurnal, knoe, knoen, knolledge, laf, lern, Lundon, looze, luv, loe, loer, mashien, meny, mesure, munney, munth, muther, moove(ment), nyther/neether, nun (for "none"), nuthing, wunce, wun, oenly, uther, aught (for "ought"), oen, peeple, plesant, plesure, pritty, proove, pooll (for "pull"), poot (for "put"), quorter, red (for "read"), reddy, receev, remoove, roel, sault, sed, ses, shoo, shood (for "should"), shoelder, shoe (for "show"), smaul, snoe, sum(thing,times), sun (for "son"), soel, spred, strainge, tauk, taul, tair, thare, tharefore, tho, thaught, thru, tuch, tord (for "toward"), trubble, too (for "two"), waul, wont (for "want"), wor (for "war"), worm (for "warm"), wos, wosn't, wosh, Woshington, wotch, wauter, wair, wether, whot(ever), whare, hoo, hoel, hoome (for "whom"), hooze, wooman, wimmen, wunder(ful), woen't, wurd, wurk, wurld, wood (for "would"), woodn't, yoo, yung, yoor(self), yoo'r, yoothe.
And here are the words that have an unstressed vowel changed. Most unstressed vowel spellings are left alone in Regularized Inglish, but these are changed to avoid confusion.
capten, certen, certenly, mounten, forren, felloe, folloe, folloeing, narroe, tomorroe, windoe, yelloe.
From the same list, changes involving adding or removing a final e:
afternoone, childe, wilde, behinde, finde, kinde, minde, winde (verb), sine (for "sign"), moste, poaste, bothe, truthe, coole, foode, foole, moone, roofe, roome, scoole, soone, gon, Europ, ar, wer, figur, promis, purpos, minut, hav, giv, liv, nativ, believ, leav, serv, twelv, themselvs.
And finally, words with changes in consonants:
caracter, dout, gard, onnour/onnor, our (for "hour"), iland, lissen, ov, offen, scoole, shugar, shure, studdy, sugest, sunn, winn.
Here are the analogous lists for the next most frequent 1000 words:
accumpany, ahed, aincient, eny(body,wun,way), arrainge, baught, boe, boel, bred, brekfast, brest, breth, braud, bery, boosh, cam (for "calm"), cassle, chaimber, clark/clerk, curnel, cumfort(able), cupple, currage, cuzin, daingerous, discuvery, disese, duzen, ern, everywhare, exchainge, faulen, faulse, flud, foek, faught, foarth, frend(ly,ship), foolly (for "fully"), glance, gloe, guvernor, greitly, groen, groth, hunney, improove, jurney, kee, lether, luvly, luver, mashienery, magazien, ment, utherwise, oe, oener, poliece, poar, prayr, poosh, quolity, quontity, quorrel, rainge, recaul, recuver, ruff, rute, roe, scaircely, serch, seeze, shoen, sloe (for "slow"), sloely, sum(wum,whot,where), saught, sorce, suthern, steddy, strainger, thare'z, thred, thretten, thruout, throe (for "throw"), throen, tung, tresure, unknoen, wonder (for "wander"), worn (for "warn"), welth, woolf, wunn (for "won"), wurker, wurry, wurse, wurship, wurst, wurthy, woonde, yoo'd, yoo'll, yoo'v.
automobiel, curten, equol, welcum, holloe, shaddoe, sorroe.
blinde, kindely, desine, worne, noone, shoote, smoothe, troope, handsom, determin, engin, examin, imagin, purchase, favourit/favorit, immediat(ly), opposit, privat, senat, separat (adj.), havn't, activ, representativ, observ, ourselvs, preserv, reserv.
ashure, clime, Crist(ian,mas), det, onnest, leag, Lincon, sacrifise, summ, shurely, sord, Tomas, whissle.
These lists probably contain minor errors of transcription.

Regularized Inglish: theory

I see I promised to post on Regularized Inglish (RI) back in 2005, but never got around to it. Here's a brief explanation.

Axel Wijk's Regularized Inglish is a massive multi-decade job (completed in the 1950s, so there's nothing available online about it) of analyzing practically every word in the language, figuring out what the complicated rules behind the spelling system really are, and identifying all the truly irregular words and proposing properly rule-governed spellings for them. English, e.g. is truly irregular in its first vowel only, and so it becomes Inglish.

The underlying principle of RI is that every spelling shall correspond to at most a few sounds, preferably only one; multiple spellings for a single sound, however, are tolerated. Thus -ough is kept for bough, but not for rough, through, plough, hiccough, hough, or borough. Why choose bough? In order to consistently apply the RI rule that says "gh has no effect on the pronunciation of any word".

In my opinion, Wijk goes a bit far in a few places: for example, he introduces dh for the sound of th in father in other than initial positions (the, not dhe) for very little gain; he sorts out long "a" into a as in fat and "aa" as in father; he changes s to z when pronounced that way, except in the plural of nouns (not nounz)and the third-person singular ending of verbs. I wouldn't bother with any of these changes, which have little impact on being able to pronounce words at sight.

But as reformed (not revolutionized) spellings go, RI is a Great Thing.

2007-02-19

A haiku

Quantum mechanics:
    Which moves through the springtime pond?
        Frog or a ripple?

This is post 200.

Tagsoup 1.0.4 released!

Just a bug-fix release. See tagsoup.info.

2007-02-18

The heart of Celebes Kalossi

I'm finally ready to explain the central ideas of my object-oriented programming model Celebes Kalossi (read the linked post first for the terminology). I've been hanging fire on it for a long time over a terminological issue, which I've decided to just punt on.

In CK, there are two kinds of relationships between classes: subtyping and incorporation. Subtyping is like the relationship between a Java interface and its superinterface; incorporation is something like C++ private inheritance. These two concepts are intertwingled in various ways in various programming languages, but CK completely separates them. Every class subtypes one or more classes except for the root of the class hierarchy (if there is one); classes can incorporate zero or more other classes.

When you declare that class A subtypes class B, you mean that A (the subclass) has all the methods that are declared or defined public in B (the superclass) and all of B's superclasses, plus those declared or defined public in A itself. You are not saying that any of the definitions provided in B or its superclasses are or are not available in A, so there is no problem if a given method is public in more than one superclass. You are also suggesting that an instance of A is Liskov substitutable for an instance of B, although it is impossible to check this property mechanically. We call the public methods of A and its superclasses the interface of A.

When you declare that class C incorporates class D, on the other hand, you mean that the non-private methods defined in class D (the incorporated class) are effectively given the same definition in class C (the incorporating class). Provided that the methods in D have been properly declared in class C, they can be invoked on objects of class C just as if they had been defined as standard or public methods of class C. It does not matter if a method is defined in class D and declared in class C (which is C++ private inheritance) or vice versa. However, the fact that a method is public in class D does not make it public in class C unless it is declared public in class C or one of its superclasses: incorporation affects behavior but not interface.

For convenience, we say that a class subtypes itself and incorporates itself. Both incorporation and subtyping are transitive: class A subtypes class B's superclasses as well as class B itself, and classes incorporated by class D are implicitly incorporated by class C as well. All the methods in the incorporated classes of C are placed on an equal footing: it does not matter how they were incorporated. A loop in the subtype hierarchy specifies that all the types in the loop have the same interface; a loop in the incorporation hierarchy is ignored, so even if C incorporates D, D can also incorporate C. Note that an implementation may provided types which are outside the model: typical examples would be numeric types, strings, and exception objects.

If a method is declared in a class X or in any of the classes incorporated (directly or indirectly) by class X, it must also be defined exactly once in one of those classes. (If there are no definitions of some method, the class is abstract and must be declared as such.) If incorporating class Z would cause conflicts, CK provides for renaming or hiding unwanted methods when incorporating a class: class L can incorporate class M including specified methods (in which case all others are hidden), excluding specified methods, or renaming a method as a new name. Classes that incorporate L see the changed view.

There are no inheritance rules in CK, because there is no inheritance as such: if you want X to use the same implementation of method m as its superclass Y, then incorporate whichever class Z from which Y gets its implementation of method m. You can do the same thing for classes that are not related to you in the superclass hierarchy, or even for your subclasses if you really want to -- the model is nothing if not flexible.

I'll have another post, hopefully soon, about how CK might be implemented on the JVM or CLR.

2007-02-03

TagSoup 1.0.3 released!

For TagSoup users among my readers, there's a new release at http://tagsoup.info, providing control of the output encoding and fixing a few bugs.

If you aren't a user but think you might like to be, TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Update: TagSoup 1.0.2 had some brown-paper-bag bugs, so I've released 1.0.3 as a replacement.