2008-01-31

Taggle, a TagSoup in C++, available now

A company called JezUK has released Taggle, which is a straight port of TagSoup 1.2 to C++. It's a part of Arabica, a C++ XML toolkit providing SAX, DOM, XPath, and partial XSLT. I have no connection with JezUK (except apparently as source of inspiration).

The author says the code is alpha-quality now, so he'd appreciate lots of testers to shake out bugs. C++ users, go to it! Having a C++ port will be a real enhancement for TagSoup.

The code is currently in public Subversion: you can fetch it with svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port.

2008-01-10

Revised home page

I've rewritten my home page at http://www.ccil.org/~cowan. Some interesting old things that were on my site but had no pointers from there now have little writeups, and I've reorganized it a bit -- but it's still the ultimate minimalist home page, no pictures or graphics.

2008-01-05

TagSoup 1.2 released at long last

There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the CHANGES file in the source distribution. Very special thanks to Jojo Dijamco, whose intensive efforts at debugging made this release a usable upgrade rather than a useless mass of undetected bugs.

  • As noted above, I have changed the license to Apache 2.0.

  • The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with parser.setFeature(Parser.emptyBogonsFeature, true).

  • The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like foo?cdown=32&cup=42 are no longer seen as containing an instance of the ∪ character (whose name happens to be cup).

  • Several new switches have been added:

    • --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers.

    • --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.)

    • --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML).

  • The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.

  • The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes.
  • All bugs around newline processing on Windows should now be gone.

  • A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.

  • Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.

  • The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp.

  • In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked".

  • Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.

Download the TagSoup 1.2 jar file here. It's about 87K long.
Download the full TagSoup 1.2 source here. If you don't have zip, you can use jar to unpack it.
Download the current CHANGES file here.