2008-01-05

TagSoup 1.2 released at long last

There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the CHANGES file in the source distribution. Very special thanks to Jojo Dijamco, whose intensive efforts at debugging made this release a usable upgrade rather than a useless mass of undetected bugs.

  • As noted above, I have changed the license to Apache 2.0.

  • The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with parser.setFeature(Parser.emptyBogonsFeature, true).

  • The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like foo?cdown=32&cup=42 are no longer seen as containing an instance of the ∪ character (whose name happens to be cup).

  • Several new switches have been added:

    • --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers.

    • --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.)

    • --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML).

  • The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.

  • The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes.
  • All bugs around newline processing on Windows should now be gone.

  • A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.

  • Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.

  • The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp.

  • In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked".

  • Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.

Download the TagSoup 1.2 jar file here. It's about 87K long.
Download the full TagSoup 1.2 source here. If you don't have zip, you can use jar to unpack it.
Download the current CHANGES file here.

6 comments:

Danny said...

Hi John, I've got a Tidy kind of requirement, wondering how suitable TagSoup would be (and if you have any tips on easy implementation).

So I'm starting with my blog data in a 35MB RDF/XML file, and I've got a chunker to split it into manageable blocks so they can be POSTed up to a remote triplestore.

Unfortunately, the HTML (in content:encoded elements) is very dodgy.

Ultimately I'd like to set up some kind of declarative pipeline system (XProc?):

[raw] - [remove foaf:mbox properties] - [chunker] - [clean HTML] - [post]

What do you reckon?

Danny said...

Oh, and congrats on the release!

John Cowan said...

It's not clear to me whether the embedded HTML is as-is (thus rendering the whole document not well formed XML) or whether it's been encoded with &amp and &lt; references.

In the first case, you could try running TagSoup over the whole thing and see what you get. TagSoup is not a proper XML parser, and namespace URI information in particular will be lost, but at least the result will be well-formed, which means you can use other tools to process it.

In the second case, you'd want to pull out each bit of embedded HTML, de-escape it (replace each "&lt;" with "<", each "&amp;" with "&", and likewise for "&gt;", "&quote;" and "&apos;") and process it separately, reintegrating the result.

David Carlisle said...

Congratulations on the release, interesting that you've picked up the entities draft, nice to see someone other than me using that:-)

The download links at the end of the post all give 404 errors (they work from the version at http://home.ccil.org/~cowan/XML/tagsoup/)

John Cowan said...

Thanks, David. Links updated.

Anonymous said...

I was surprised that you identified your employer, Santa Claus, in your about paragraph. I thought Santa's workers were sworn to secrecy. ;)

Reading about the others with your name, I was glad to find that you and your cousin Jack were not the same person. For you to be your own cousin would require your parents to be siblings.

Interesting blog, thanks.