TAGSOUP(1)                       User Commands                      TAGSOUP(1)



[1mNAME[0m
       TagSoup - convert nasty, ugly HTML to clean XHTML

[1mSYNOPSIS[0m
       [1mjava -jar tagsoup-1.0.4 [22m[ [4moptions[24m ] [ [4mfiles[24m ]

[1mDESCRIPTION[0m
       Rectify  arbitrary  HTML into clean XHTML, using a tailored description
       of HTML.  The output will be well-formed XML, but not necessarily [4mvalid[0m
       XHTML.


       [1m--files[0m
              multiple input [4mfiles[24m should be processed into corresponding out-
              put files

       [1m--encoding=[4m[22mencoding[0m
              specifies the encoding of input files

       [1m--output-encoding=[4m[22mencoding[0m
              specifies the encoding of  the  output  (if  the  encoding  name
              begins with ``utf'', the output will not contain character enti-
              ties; otherwise, all non-ASCII  characters  are  represented  as
              entities)

       [1m--html [22moutput rectified HTML rather than XML, omitting the XML declara-
              tion and any namespace declarations

       [1m--method=html[0m
              output rectified HTML rather than XML (end-tags are omitted  for
              empty  elements, and no character escaping is done in script and
              style elements)

       [1m--omit-xml-declaration[0m
              omit the XML declaration

       [1m--lexical[0m
              output lexical features (specifically comments and  any  DOCTYPE
              declaration)

       [1m--nons [22msuppress namespaces in output

       [1m--nobogons[0m
              suppress unknown non-HTML elements in output

       [1m--nodefaults[0m
              suppress default attribute values

       [1m--nocolons[0m
              change  explicit colons in element and attribute names to under-
              scores

       [1m--norestart[0m
              don't restart any restartable elements

       [1m--ignorable[0m
              pass through ignorable whitespace  (whitespace  in  element-only
              content) via SAX method handler ignorableWhitespace

       [1m--any  [22mtreat unknown non-HTML elements as allowing any content

       [1m--nocdata[0m
              treat  the  CDATA-content  elements [4mscript[24m and [4mstyle[24m as ordinary
              elements (mostly for testing)

       [1m--pyx  [22moutput PYX format rather than XML (mostly for testing)

       [1m--pyxin[0m
              input is PYX-format HTML (mostly for testing)

       [1m--reuse[0m
              reuse the same Parser object internally (for testing only)

       [1m--help [22moutput basic help

       [1m--version[0m
              output version number

       [1mTagSoup [22mis a parser and reformatter for nasty, ugly HTML.   Its  normal
       processing  mode  is  to accept HTML files on the command line, or from
       the standard input if none are given, and output them as clean  XML  to
       the  standard output.  The encoding is assumed to be the platform-local
       encoding on input, and is always UTF-8 on output.

       When the [1m--files [22moption is given, each input file is processed into  an
       output  file  of  the corresponding name, with the extension changed to
       [4mxhtml[24m.  If the extension is already [4mxhtml[24m, it is changed to [4mxhtml_[24m.

       TagSoup will repair, by whatever means  necessary,  violations  of  XML
       well-formedness.   In  particular,  it  will fix up malformed attribute
       names and supply missing attribute-value quotation marks.  More signif-
       icantly, it supplies end-tags where HTML allows them to be omitted, and
       sometimes where it doesn't.  It will even supply start-tags where  nec-
       essary; for example, if a document begins with a <li> tag, TagSoup will
       automatically prefix it with <html><body><ul>.


[1mBUGS[0m
       TagSoup can be fooled by missing close quotes after  attribute  values,
       and  by  incorrect character encodings (it does not contain an encoding
       guesser).

       TagSoup doesn't understand namespace declarations, which are not  prop-
       erly  part  of  HTML.  Instead, any element or attribute name beginning
       [4mfoo[24m: will be put into the artificial namespace urn:x-prefix:[4mfoo[24m.

       For the same reasons,  namespace-qualified  attributes  like  xml:space
       can't  be  returned  as default values, though an explicit attribute in
       the xml namespace will be returned with the proper namespace URI.

[1mAUTHOR[0m
       John Cowan <cowan@ccil.org>

[1mCOPYRIGHT[0m
       Copyright  2007 John Cowan
       This is free software; see the source for copying conditions.  There is
       NO  warranty;  not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
       PURPOSE.



TagSoup 1.1                       March 2007                        TAGSOUP(1)
