There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the CHANGES file in the source distribution. Very special thanks to Jojo Dijamco, whose intensive efforts at debugging made this release a usable upgrade rather than a useless mass of undetected bugs.
As noted above, I have changed the license to Apache 2.0.
The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the
--emptybogonsswitch, or programmatically with
The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like
foo?cdown=32&cup=42are no longer seen as containing an instance of the ∪ character (whose name happens to be
Several new switches have been added:
DOCTYPEdeclaration to be output and allow setting the system and public identifiers.
--versionallow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use
--norootbogonscauses unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the
htmlelement for HTML).
The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.
- The SAX events
endPrefixMappingare now being reported for all cases of foreign elements and attributes.
All bugs around newline processing on Windows should now be gone.
- A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.
spanelement is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.
The following non-standard elements are now at least partly supported:
In HTML output mode, boolean attributes like
checkedare now output as such, rather than in XML style as
Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.