'\" t
.tr \(ts"
.ds S \s-1SGML\s0
.de TS
.br
.sp .5
..
.de TE
.br
.sp .5
..
.de TQ
.br
.ns
.TP \\$1
..
.TH NSGMLS 1
.SH NAME
nsgmls \- a validating SGML parser
.sp
An \*S System Conforming to
.if n .br
International Standard ISO 8879 \(em
.br
Standard Generalized Markup Language
.SH SYNOPSIS
.B nsgmls
[
.B \-deglprsuv
]
[
.BI \-a linktype
]
[
.BI \-i name
]
[
.BI \-m file
]
[
.I filename\|.\|.\|.
]
.SH DESCRIPTION
.I Nsgmls
parses and validates
the \*S document entity in
.I filename\|.\|.\|.
and prints on the standard output a simple text representation of its
Element Structure Information Set.
(This is the information set which a structure-controlled
conforming \*S application should act upon.)
Note that the document entity may be spread amongst several files;
for example, the SGML declaration, document type declaration and document
instance set could each be in a separate file.
If no filenames are specified, then
.I nsgmls
will read the document entity from the standard input.
Each filename is actually interpreted as a system identifier.
A command line filename of
.B \-
can be used to refer to the standard input.
(Normally in a system identifier,
.B fd:0
is used to refer to standard input.)
.LP
The following options are available:
.TP
.BI \-a linktype
Make link type
.I linktype
active.
Not all ESIS information is output in this case:
the active LPDs are not explicitly reported,
although each link attribute is qualified with
its link type name;
there is no information about result elements;
when there are multiple link rules applicable to the
current element,
.I nsgmls
always chooses the first.
.TP
.B \-d
Warn about duplicate entity declarations.
.TP
.B \-e
Describe open entities in error messages.
Error messages always include the position of the most recently
opened external entity.
.TP
.B \-g
Show the \s-1GI\s0s of open elements in error messages.
.TP
.BI \-i name
Pretend that
.RS
.IP
.BI <!ENTITY\ %\  name\  \(tsINCLUDE\(ts>
.LP
occurs at the start of the document type declaration subset
in the \*S document entity.
Since repeated definitions of an entity are ignored,
this definition will take precedence over any other definitions
of this entity in the document type declaration.
Multiple
.B \-i
options are allowed.
If the \*S declaration replaces the reserved name
.B INCLUDE
then the new reserved name will be the replacement text of the entity.
Typically the document type declaration will contain
.IP
.BI <!ENTITY\ %\  name\  \(tsIGNORE\(ts>
.LP
and will use
.BI % name ;
in the status keyword specification of a marked section declaration.
In this case the effect of the option will be to cause the marked
section not to be ignored.
.RE
.TP
.B \-l
Output
.B L
commands giving the current line number and filename.
.TP
.BI \-m file
Map public identifiers and entity names to system identifiers
using the catalog entry file whose system identifier is
.IR file .
Multiple
.B \-m
options are allowed.
Catalog entry files specified with the
.B -m
option will be searched before the defaults.
.TP
.B \-p
Parse only the prolog.
.I Nsgmls
will exit after parsing the document type declaration.
Implies
.BR \-s .
.TP
.B \-r
Warn about defaulted references.
.TP
.B \-s
Suppress output.
Error messages will still be printed.
.TP
.B \-u
Warn about undefined elements: elements used in the DTD but not defined.
.TP
.B \-v
Print the version number.
.SS "External entities"
An external entity resides in one or more storage objects,
each of which contains a sequence of bytes.
The entity manager component of
.I nsgmls
maps a sequence of storage objects into an entity as follows:
.IP 1.
The bytes in each storage object are converted into characters,
according to the coding system associated with the storage object.
.IP 2.
The characters in each storage object are concatenated.
.IP 3.
The sequence of characters is treated as a sequence of lines each
terminated by a newline character.  A record start is inserted at the
beginning of each line, and a record end at the end of each line.  If
there is a partial line (a line that ends with a character other than
a newline) at the end of the file, then a record start will be
inserted before it but no record end will be inserted after it.
.SS "System identifiers"
A system identifier describes a sequence of storage objects,
each optionally associated with a coding system.
.I Nsgmls
will attempt to interpret a system identifier
as a keyword followed by a
colon followed by a string, which is interpreted in a keyword-dependent
way.
Keywords are case-insensitive.
The following keywords are recognized:
.TP
.B file
The string is interpreted as a filename.  The system identifier
describes a single storage object that will be read from the named
file.
.TP
.B fd
The string is as a number.  The system identifier describes
a single storage object that will read from the
file descriptor with that number.  For example,
.B fd:0
will read the storage object from standard input.
.TP
.B concat
The string is treated as a list of substrings separated by
.B +
characters.
Each of the substrings is in turn interpreted as a system identifier,
and the sequences of storage objects that each denote are
concatenated.
The
.B concat
system identifier describes the resulting sequence of storage objects.
.TP
.B utf8
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the UTF8 coding system.
Invalid multi-byte sequences are represented by the character 0xFFFD.
This keyword is recognized only in the multi-byte version of
.IR nsgmls .
.TP
.B ucs2
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the UCS2 coding system.
The more significant octet of each character always
precedes the less significant octet irrespective of the system's
native byte-order.
The codes 0xFFFE and 0xFEFF are not treated specially in any way.
This keyword is recognized only in the multi-byte version of
.IR nsgmls .
.TP
.B unicode
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the Unicode coding system.
The Unicode coding system treats each pair of octets
as a character in the system's byte order.
If the first character is the byte order mark character (0xFEFF),
it will be discarded.
If the first character is the byte order mark character byte-swapped,
it will be discarded and the remaining characters will be byte-swapped.
This keyword is recognized only in the multi-byte version of
.IR nsgmls .
.TP
.B ujis
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the UJIS (EUC) coding system.
The UJIS coding system encodes 4 graphic character sets using between
1 and 3 bytes for each character.
These 4 character sets are effectively combined into a single
character set by giving each character a unique code that depends on
which of the 4 character sets the character belongs to,
and the code of the character in that character set.
The code of characters
in the G0 set (usually the Japanese version of ISO 646) is unchanged.
The code of characters in the G1 set (usually JIS X 0208-1990) is ORed
with 0x8080.  The code of characters in the G2 set (usually half-width
katakana from JIS X 0201-1986) is ORed with 0x0080.  The code of
characters in the G3 set (JIS X 0212-1990) is ORed with 0x8000.
This keyword is recognized only in the multi-byte version of
.IR nsgmls .
.TP
.B sjis
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the Shift-JIS coding system.
Characters will have the same codes that they do with the
.B ujis
coding system
(except for characters in the G3 set which are not representable
using Shift-JIS.)
This keyword is recognized only in the multi-byte version of
.IR nsgmls .
.TP
.B identity
The string is interpreted as a system identifer.  Each storage object
that it describes that is not associated with a coding system is
associated with the identity coding system.  The identity coding
system converts bytes to characters by zero-extending each character.
.LP
If a system identifier does not contain a keyword
or uses a keyword that is not recognized, then
the system identifier will be treated as a filename.
Note that the system identifier
.B file:utf8:doc.sgm
identifies the file named
.B utf8:doc.sgm
but
.B utf8:file:doc.sgm
identifies the file named
.B doc.sgm
using the
.B utf8
coding scheme.
.LP
A relative filename in a system identifier
is interpreted relative to the file in which the system
identifier is specified, if any, and otherwise relative
to the current directory.
This applies both to system identifiers specified in SGML documents,
and to system identifiers specified in catalog entry files.
.LP
If a system identifier does not specify the coding system,
the coding system of the storage object in which the
system identifier was specified will be used.
.SS "System identifier generation"
If a system identifier is not specified,
then the entity manager will attempt to generate one using catalog
entry files in the format defined in the SGML Open Draft Technical
Resolution on Entity Management.  A catalog entry file contains a
sequence of entries in one of the following four forms:
.TP
.BI PUBLIC\  pubid\ sysid
This specifies that
.I sysid
should be used as the system identifier if the public
identifier is
.IR pubid .
.I Sysid
is a system identifier as defined in ISO 8879 and
.I pubid
is a public identifier as defined in ISO 8879.
.TP
.BI ENTITY\  name\ sysid
This specifies that
.I sysid
should be used as the system identifier if the entity is a general
entity whose name is
.IR name .
.TP
.BI ENTITY\ % name\ sysid
This specifies that
.I sysid
should be used as the system identifier if the entity is a parameter
entity whose name is
.IR name .
Note that there is no space between the
.B %
and the
.IR name .
.TP
.BI DOCTYPE\  name\ sysid
This specifies that
.I sysid
should be used as the system identifier if the entity is an
entity declared in a document type declaration whose document type name is
.IR name .
.TP
.BI LINKTYPE\  name\ sysid
This specifies that
.I sysid
should be used as the system identifier if the entity is an
entity declared in a link type declaration whose link type name is
.IR name .
.LP
The last two forms are extensions to the SGML Open format.
The delimiters can be omitted from the
.I sysid
provided it does not contain any white space.
Comments are allowed between parameters delimited by
.B --
as in SGML.
The environment variable
.B \s-1SGML_CATALOG_FILES\s0
contains a colon-separated list of catalog entry files.
These will be searched after any catalog entry files specified
using the
.B \-m
option.
If this environment variable is not set,
then a system dependent list of catalog entry files will be used.
A match in a catalog entry file for a PUBLIC entry will take
precedence over a match in the same file for an ENTITY,
DOCTYPE or LINKTYPE entry.
.br
.ne 18
.SS "System declaration"
The system declaration for
.I nsgmls
is as follows:
.LP
.TS
tab(&);
c1 s1 s1 s1 s1 s1 s1 s1 s
c s s s s s s s s
l l s s s s s s s
l l s s s s s s s
l l s s s s s s s
l l l s s s s s s
c s s s s s s s s
l l l l l l l l l
l l l l l l l l l
l l l l l l l l l
l l s s s s s s s
l l l s s s s s s
l l l s s s s s s
c s s s s s s s s
l l l l l l l l l.
SYSTEM "ISO 8879:1986"
CHARSET
BASESET&"ISO 646-1983//CHARSET
&\h'\w'"'u'International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET&0\0128\00
CAPACITY&PUBLIC&"ISO 8879:1986//CAPACITY Reference//EN"
FEATURES
MINIMIZE&DATATAG&NO&OMITTAG&YES&RANK&YES&SHORTTAG&YES
LINK&SIMPLE&YES 65536&IMPLICIT&YES&EXPLICIT&YES 1
OTHER&CONCUR&NO&SUBDOC&YES 100&FORMAL&YES
SCOPE&DOCUMENT
SYNTAX&PUBLIC&"ISO 8879:1986//SYNTAX Reference//EN"
SYNTAX&PUBLIC&"ISO 8879:1986//SYNTAX Core//EN"
VALIDATE
&GENERAL&YES&MODEL&YES&EXCLUDE&YES&CAPACITY&NO
&NONSGML&YES&SGML&YES&FORMAL&YES
.T&
c s s s s s s s s
l l l l l l l l l.
SDIF
&PACK&NO&UNPACK&NO
.TE
.LP
The limit for the \s-1SUBDOC\s0 parameter is memory dependent.
.LP
Any legal concrete syntax may be used.
.SS "\*S declaration"
The \*S declaration may be omitted,
the following declaration will be implied:
.TS
tab(&);
c1 s1 s1 s1 s1 s1 s1 s1 s
c s s s s s s s s
l l s s s s s s s.
<!SGML "ISO 8879:1986"
CHARSET
BASESET&"ISO 646-1983//CHARSET
&\h'\w'"'u'International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET&\0\00\0\09\0UNUSED
&\0\09\0\02\0\09
&\011\0\02\0UNUSED
&\013\0\01\013
&\014\018\0UNUSED
&\032\095\032
&127\0\01\0UNUSED
.T&
l l l s s s s s s
l l s s s s s s s
l l l s s s s s s
c s s s s s s s s
l l l l l l l l l.
CAPACITY&PUBLIC&"ISO 8879:1986//CAPACITY Reference//EN"
SCOPE&DOCUMENT
SYNTAX&PUBLIC&"ISO 8879:1986//SYNTAX Reference//EN"
FEATURES
MINIMIZE&DATATAG&NO&OMITTAG&YES&RANK&NO&SHORTTAG&YES
LINK&SIMPLE&NO&IMPLICIT&NO&EXPLICIT&NO
OTHER&CONCUR&NO&SUBDOC&YES 99999999&FORMAL&YES
.T&
c s s s s s s s s.
APPINFO NONE>
.TE
with the exception that characters 128 through 254 will be assigned to
\s-1DATACHAR\s0.
.LP
.I Nsgmls
identifies base character sets using the designating sequence in the
public identifier.  The following designating sequences are
recognized:
.TS
tab(&);
c c c c c
c c c c ^
c c c c ^
l n n n l.
Designating&ISO&Minimum&Number&Description
Escape&Registration&Character&of&
Sequence&Number&Number&Characters&
_
ESC 2/5 4/0&-&0&128&full set of ISO 646 IRV
ESC 2/8 4/0&2&0&128&G0 set of ISO 646 IRV
ESC 2/8 4/2&6&0&128&G0 set of ASCII
ESC 2/13 4/1&100&0&128&G1 set of ISO 8859-1
ESC 2/1 4/0&1&0&32&C0 set of ISO 646
ESC 2/2 4/3&77&0&32&C1 set of ISO 6429
ESC 2/5 2/15 4/0&162&0&65536&ISO 10646 UCS-2 level 1
ESC 2/5 2/15 4/3&174&0&65536&ISO 10646 UCS-2 level 2
ESC 2/5 2/15 4/5&176&0&65536&ISO 10646 UCS-2 level 3
ESC 2/5 2/15 4/1&163&0&2147483648&ISO 10646 UCS-4 level 1
ESC 2/5 2/15 4/4&175&0&2147483648&ISO 10646 UCS-4 level 2
ESC 2/5 2/15 4/6&177&0&2147483648&ISO 10646 UCS-4 level 3
.TE
.LP
The graphic character sets do not strictly include
C0 and C1 control character sets.
For convenience,
.I nsgmls
augments the graphic character sets with the appropriate
control character sets.
.SS "Output format"
The output is a series of lines.
Lines can be arbitrarily long.
Each line consists of an initial command character
and one or more arguments.
Arguments are separated by a single space,
but when a command takes a fixed number of arguments
the last argument can contain spaces.
There is no space between the command character and the first argument.
Arguments can contain the following escape sequences.
.TP
.B \e\e
A
.BR \e.
.TP
.B \en
A record end character.
.TP
.B \e|
Internal \s-1SDATA\s0 entities are bracketed by these.
.TP
.BI \e nnn
The character whose code is
.I nnn
octal.
.LP
A record start character will be represented by
.BR \e012 .
Most applications will need to ignore
.B \e012
and translate
.B \en
into newline.
.LP
The possible command characters and arguments are as follows:
.TP
.BI ( gi
The start of an element whose generic identifier is
.IR gi .
Any attributes for this element
will have been specified with
.B A
commands.
.TP
.BI ) gi
The end an element whose generic identifier is
.IR gi .
.TP
.BI \- data
Data.
.TP
.BI & name
A reference to an external data entity
.IR name ;
.I name
will have been defined using an
.B E
command.
.TP
.BI ? pi
A processing instruction with data
.IR pi .
.TP
.BI A name\ val
The next element to start has an attribute
.I name
with value
.I val
which takes one of the following forms:
.RS
.TP
.B IMPLIED
The value of the attribute is implied.
.TP
.BI CDATA\  data
The attribute is character data.
This is used for attributes whose declared value is
.BR \s-1CDATA\s0 .
.TP
.BI NOTATION\  nname
The attribute is a notation name;
.I nname
will have been defined using a
.B N
command.
This is used for attributes whose declared value is
.BR \s-1NOTATION\s0 .
.TP
.BI ENTITY\  name\|.\|.\|.
The attribute is a list of general entity names.
Each entity name will have been defined using an
.BR I ,
.B E
or
.B S
command.
This is used for attributes whose declared value is
.B \s-1ENTITY\s0
or
.BR \s-1ENTITIES\s0 .
.TP
.BI TOKEN\  token\|.\|.\|.
The attribute is a list of tokens.
This is used for attributes whose declared value is anything else.
.RE
.TP
.BI D ename\ name\ val
This is the same as the
.B A
command, except that it specifies a data attribute for an
external entity named
.IR ename .
Any
.B D
commands will come after the
.B E
command that defines the entity to which they apply, but
before any
.B &
or
.B A
commands that reference the entity.
.TP
.BI a type\ name\ val
The next element to start has a link attribute with link type
.IR type ,
name
.IR name ,
and value
.IR val ,
which takes the same form as with the
.B A
command.
.TP
.BI N nname
.IR nname.
Define a notation
This command will be preceded by a
.B p
command if the notation was declared with a public identifier,
and by a
.B s
command if the notation was declared with a system identifier.
A notation will only be defined if it is to be referenced
in an
.B E
command or in an
.B A
command for an attribute with a declared value of
.BR \s-1NOTATION\s0 .
.TP
.BI E ename\ typ\ nname
Define an external data entity named
.I ename
with type
.I typ
.RB ( \s-1CDATA\s0 ,
.B \s-1NDATA\s0
or
.BR \s-1SDATA\s0 )
and notation
.IR not.
This command will be preceded by one or more
.B f
commands giving the filenames generated by the entity manager from the system
and public identifiers,
by a
.B p
command if a public identifier was declared for the entity,
and by a
.B s
command if a system identifier was declared for the entity.
.I not
will have been defined using a
.B N
command.
Data attributes may be specified for the entity using
.B D
commands.
An external data entity will only be defined if it is to be referenced in a
.B &
command or in an
.B A
command for an attribute whose declared value is
.B \s-1ENTITY\s0
or
.BR \s-1ENTITIES\s0 .
.TP
.BI I ename\ typ\ text
Define an internal data entity named
.I ename
with type
.I typ
.RB ( \s-1CDATA\s0
or
.BR \s-1SDATA\s0 )
and entity text
.IR text .
An internal data entity will only be defined if it is referenced in an
.B A
command for an attribute whose declared value is
.B \s-1ENTITY\s0
or
.BR \s-1ENTITIES\s0 .
.TP
.BI S ename
Define a subdocument entity named
.IR ename .
This command will be preceded by one or more
.B f
commands giving the filenames generated by the entity manager from the system
and public identifiers,
by a
.B p
command if a public identifier was declared for the entity,
and by a
.B s
command if a system identifier was declared for the entity.
A subdocument entity will only be defined if it is referenced
in a
.B {
command
or in an
.B A
command for an attribute whose declared value is
.B \s-1ENTITY\s0
or
.BR \s-1ENTITIES\s0 .
.TP
.BI s sysid
This command applies to the next
.BR E ,
.B S
or
.B N
command and specifies the associated system identifier.
.TP
.BI p pubid
This command applies to the next
.BR E ,
.B S
or
.B N
command and specifies the associated public identifier.
.TP
.BI f filename
This command applies to the next
.B E
or
.B S
command and specifies an associated filename.
There will be more than one
.B f
command for a single
.B E
or
.B S
command if the system identifier used a
.if \n(Os=0 colon.
.if \n(Os=1 semi-colon.
.TP
.BI { ename
The start of the \*S subdocument entity
.IR ename ;
.I ename
will have been defined using a
.B S
command.
.TP
.BI } ename
The end of the \*S subdocument entity
.IR ename .
.TP
.BI L lineno\ file
.TQ
.BI L lineno
Set the current line number and filename.
The
.I filename
argument will be omitted if only the line number has changed.
This will be output only if the
.B \-l
option has been given.
.TP
.BI # text
An \s-1APPINFO\s0 parameter of
.I text
was specified in the \*S declaration.
This is not strictly part of the ESIS, but a structure-controlled
application is permitted to act on it.
No
.B #
command will be output if
.B \s-1APPINFO\s0\ \s-1NONE\s0
was specified.
A
.B #
command will occur at most once,
and may be preceded only by a single
.B L
command.
.TP
.B C
This command indicates that the document was a conforming \*S document.
If this command is output, it will be the last command.
An \*S document is not conforming if it references a subdocument entity
that is not conforming.
.SH ENVIRONMENT
.TP
.SM
.B NSGMLS_CODE
If this is set to the name of a coding system, then that coding system
will be used as the default coding system for everything (including
file input, file output, message output, filenames and command line
arguments).  Otherwise the
.B identity
coding system will be used.
Setting this to
.B ucs2
or
.B unicode
is unlikely to give reasonable results.
.SH "SEE ALSO"
The \*S Handbook, Charles F. Goldfarb
.br
\s-1ISO\s0 8879 (Standard Generalized Markup Language),
International Organization for Standardization
.SH BUGS
.LP
Not all ESIS information for LINK is reported.
.SH AUTHOR
.LP
James Clark (jjc@jclark.com).
