
                        TagSoup - Just Keep On Truckin'

  Introduction

   This is the home page of TagSoup, a SAX-compliant parser written in
   Java that, instead of parsing well-formed or valid XML, parses HTML as
   it is found in the wild: [1]poor, nasty and brutish, though quite
   often far from short. TagSoup is designed for people who have to
   process this stuff using some semblance of a rational application
   design. By providing a SAX interface, it allows standard XML tools to
   be applied to even the worst HTML. TagSoup also includes a
   command-line processor that reads HTML files and can generate either
   clean HTML or well-formed XML that is a close approximation to XHTML.

   TagSoup is free and Open Source software, licensed under the
   [2]Academic Free License version 3.0, a cleaned-up and patent-safe
   BSD-style license which allows proprietary re-use. It's also licensed
   under the [3]GNU GPL version 2.0, since unfortunately the GPL and the
   AFL are incompatible. You can choose to license TagSoup from me under
   either the GPL or the AFL.

  Warning: TagSoup will not build on Java 5.x or 6.x!

   Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
   TagSoup will not build out of the box. You need to retrieve [4]Saxon
   6.5.5, which does not have the bug. Unpack the zipfile in an empty
   directory and copy the saxon.jar and saxon-xml-apis.jar files to
   $ANT_HOME/lib. The Ant build process for TagSoup will then notice that
   Saxon is available and use it instead.

  TagSoup 1.0.5 released!

   This is yet another bug-fix release. The main issue was with HTML
   comments, which were very badly broken -- any > character would
   terminate one, so commenting out elements did not work properly. I
   think everything is now correct. Everyone should update who possibly
   can.

   Additionally, &#Xnnnn (with capital X) now works, some debugging code
   was removed from PYXWriter, a Unicode BOM at the beginning of a
   document is skipped, and the new version of Saxon is now supported as
   an XSLT processor.

   I've also added the promised documentation on [5]SAX features and
   properties specific to TagSoup.

   [6]Download the TagSoup 1.0.5 jar file here. It's about 51K long.
   [7]Download the full TagSoup 1.0.5 source here. If you don't have zip,
   you can use jar to unpack it.

  TagSoup 1.0.4 released!

   This is a bug-fix release. The --method=html switch, and consequently
   --html as well, wasn't working properly, and while fixing that I found
   that several switches were mutually exclusive that should not have
   been.

  TagSoup 1.0.3 released!

   There is only one new feature in this release, the --output-encoding
   switch, which allows you to specify the character encoding for TagSoup
   output. The encoding must be one that your JVM supports; if you omit
   this switch, you will get the platform default, which is generally
   either Latin-1 or its superset Windows-1252. If the encoding you
   specify is a Unicode one, TagSoup will not generate character
   references; otherwise, it will continue to do so.

   In other news, the misleading version attribute of the html element
   has been removed; comments with defective hyphens are cleaned up
   properly; broken public identifiers in the DOCTYPE declaration are
   cleaned.

   Version 1.0.2 was accidentally released with Java 6 classes in the
   jarfile, which cannot be run on earlier versions of the JVM. 1.0.3
   fixes this problem, fixes the build file, and fixes the output of the
   --version switch, which had been stuck on 1.0rc4 since that release.

  TagSoup 1.0.1 features

   One new user-supplied feature, plus two features or bugfixes,
   depending on how you look at them. None are critical, so you needn't
   update unless you care.

   Previous versions of TagSoup always ignored whitespace in elements
   that don't have PCDATA as a possible child. Now, if you turn on the
   ignorableWhitespaceFeature (or use the --ignorable option), that
   whitespace will be returned to your application through the previously
   unused ContentHandler.ignorableWhitespace callback. This isn't done by
   default for backwards compatibility, and also because HTML is an SGML
   application and SGML parsers routinely dropped such whitespace.

   If you install a LexicalHandler in order to pick up comments and
   DOCTYPE declarations (or use the --lexical option), you may get
   comments or public identifiers that aren't valid XML: in particular,
   comments may contain -- sequences. TagSoup will now insert a space
   into such sequences, as well as immediately after a final - in a
   comment. Likewise, TagSoup will now change all illegal characters in
   public identifiers to spaces. What's more, the --lexical option will
   now cause a DOCTYPE declaration to be output if there is one in the
   input.

  TagSoup 1.0 Final released

   Another small change: There is a switch --norestart to prevent
   restartable elements from being restarted.

   This is the end of my current plans for TagSoup. I will continue to
   fix bugs, but it now does everything that I foresaw back in 2002 when
   I started this project, and a great deal more. Thanks to everyone on
   the tagsoup-friends mailing list for their efforts.

  What TagSoup does

   TagSoup is designed as a parser, not a whole application; it isn't
   intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
   to parse it on the fly. Therefore, it does not convert presentation
   HTML to CSS or anything similar. It does guarantee well-structured
   results: tags will wind up properly nested, default attributes will
   appear appropriately, and so on.

   The semantics of TagSoup are as far as practical those of actual HTML
   browsers. In particular, never, never will it throw any sort of syntax
   error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
   much, much more. For example, if the first tag is LI, it will supply
   the application with enclosing HTML, BODY, and UL tags. Why UL?
   Because that's what browsers assume in this situation. For the same
   reason, overlapping tags are correctly restarted whenever possible:
   text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text

   gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

   By intention, TagSoup is small and fast. It does not depend on the
   existence of any framework other than SAX, and should be able to work
   with any framework that can accept SAX parsers. In particular, [10]XOM
   is known to work.

   You can replace the low-level HTML scanner with one based on Sean
   McGrath's [11]PYX format (very close to James Clark's ESIS format).
   You can also supply an AutoDetector that peeks at the incoming byte
   stream and guesses a character encoding for it. Otherwise, the
   platform default is used. If you need an autodetector of character
   sets, consider trying to adapt the [12]Mozilla one; if you succeed,
   let me know.

  Note: TagSoup in Java 1.1

   If you go through the TagSoup source and replace all references to
   HashMap with Hashtable and recompile, TagSoup will work fine in Java
   1.1 VMs. Thanks to Thorbjrn Vinne for this discovery.

  The TSaxon XSLT-for-HTML processor

   [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
   of Michael Kay's Saxon XSLT version 1.0 implementation that includes
   TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
   process either HTML or XML documents with XSLT stylesheets.

  TagSoup as a stand-alone program

   It is possible to run TagSoup as a program by saying java -jar
   tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
   line will be parsed individually. If no files are specified, the
   standard input is read.

   The following options are understood:

   --files
          Output into individual files, with html extensions changed to
          xhtml. Otherwise, all output is sent to the standard output.

   --html
          Output is in clean HTML: the XML declaration is suppressed, as
          are end-tags for the known empty elements.

   --omit-xml-declaration
          The XML declaration is suppressed.

   --method=html
          End-tags for the known empty HTML elements are suppressed.

   --pyx
          Output is in PYX format.

   --pyxin
          Input is in PYXoid format (need not be well-formed).

   --nons
          Namespaces are suppressed. Normally, all elements are in the
          XHTML 1.x namespace, and all attributes are in no namespace.

   --nobogons
          Bogons (unknown elements) are suppressed. Normally, they are
          treated as empty.

   --nodefaults
          suppress default attribute values

   --nocolons
          change explicit colons in element and attribute names to
          underscores

   --norestart
          don't restart any normally restartable elements

   --ignorable
          output whitespace in elements with element-only content

   --any
          Bogons are given a content model of ANY rather than EMPTY.

   --lexical
          Pass through HTML comments. Has no effect when output is in PYX
          format.

   --reuse
          Reuse a single instance of TagSoup parser throughout. Normally,
          a new one is instantiated for each input file.

   --nocdata
          Change the content models of the script and style elements to
          treat them as ordinary #PCDATA (text-only) elements, as in
          XHTML, rather than with the special CDATA content model.

   --encoding=encoding
          Specify the input encoding. The default is the Java platform
          default.

   --output-encoding=encoding
          Specify the output encoding. The default is the Java platform
          default.

   --help
          Print help.

   --version
          Print the version number.

  SAX features and properties

   TagSoup supports the following SAX features in addition to the
   standard ones:

   http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
          A value of "true" indicates that the parser will ignore unknown
          elements.

   http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
          A value of "true" indicates that the parser will give unknown
          elements a content model of EMPTY; a value of "false", a
          content model of ANY.

   http://www.ccil.org/~cowan/tagsoup/features/default-attributes
          A value of "true" indicates that the parser will return default
          attribute values for missing attributes that have default
          values.

   http://www.ccil.org/~cowan/tagsoup/features/translate-colons
          A value of "true" indicates that the parser will translate
          colons into underscores in names.

   http://www.ccil.org/~cowan/tagsoup/features/restart-elements
          A value of "true" indicates that the parser will attempt to
          restart the restartable elements.

   http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
          A value of "true" indicates that the parser will transmit
          whitespace in element-only content via the SAX
          ignorableWhitespace callback. Normally this is not done,
          because HTML is an SGML application and SGML suppresses such
          whitespace.

   TagSoup supports the following SAX properties in addition to the
   standard ones:

   http://www.ccil.org/~cowan/tagsoup/properties/scanner
          Specifies the Scanner object this parser uses.

   http://www.ccil.org/~cowan/tagsoup/properties/schema
          Specifies the Schema object this parser uses.

   http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
          Specifies the AutoDetector (for encoding detection) this parser
          uses.

  More information

   I gave a presentation (a nocturne, so it's not on the schedule) at
   [15]Extreme Markup Languages 2004 about TagSoup, updated from the one
   presented in 2002 at the New York City XML SIG and at XML 2002. This
   is the main high-level documentation about how TagSoup works. Formats:
   [16]OpenDocument [17]Powerpoint [18]PDF.

   I also had people add [19]"evil" HTML to a large poster so that I
   could [20]clean it up; View Source is probably more useful than
   ordinary browsing. The original instructions were:

                        SOUPE DE BALISES (BE EVIL)!
   Ecritez une balise ouvrante (sans attributs)
   ou fermante HTML ici, s.v.p.

   There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo
   Groups. You can [23]join via the Web, or by sending a blank email to
   [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are
   open to all.

   Online TagSoup processing for publicly accessible HTML documents is
   now [26]available courtesy of Leigh Dodds.

References

   1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
   2. http://www.opensource.org/licenses/afl-3.0.php
   3. http://www.opensource.org/licenses/gpl-license.php
   4. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
   5. http://home.ccil.org/~cowan/XML/tagsoup/#properties
   6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.5.jar
   7. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.5-src.zip
   8. http://tidy.sf.net/
   9. http://www.crumbmuseum.com/truckin.html
  10. http://www.cafeconleche.org/XOM
  11. http://gnosis.cx/publish/programming/xml_matters_17.html
  12. http://jchardet.sourceforge.net/
  13. http://www.ccil.org/~cowan
  14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
  15. http://www.extrememarkup.com/extreme/2004
  16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
  17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
  18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
  19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
  20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
  21. http://groups.yahoo.com/group/tagsoup-friends
  22. http://groups.yahoo.com/
  23. http://groups.yahoo.com/group/tagsoup-friends/join
  24. mailto:tagsoup-friends-subscribe@yahoogroups.com
  25. http://groups.yahoo.com/group/tagsoup-friends/messages
  26. http://xmlarmyknife.org/docs/xhtml/tagsoup/
