Small. Fast. Reliable.
Choose any three.

Requirements For The SQLite Tokenizer

When processing SQL statements, SQLite (as does every other SQL database engine) breaks the SQL statement up into tokens which are then forwarded to the parser component. SQL statements are split into tokens by the "tokenizer" component of SQLite. This document specifies requirements that precisely define the operation of the SQLite tokenizer.

Character classes

SQL statements are composed of unicode characters. Specific individual characters many be described using a notation consisting of the character "u" followed by four hexadecimal digits. For example, the lower-case letter "a" can be expressed as "u0061" and the dollar sign can be expressed as "u0024". For notational convenience, the following character classes are defined:

WHITESPACE

One of these five characters: u0009, u000a, u000c, u000d, or u0020

ALPHABETIC

Any of the characters in the range u0041 through u005a (letters "A" through "Z") or in the range u0061 through u007a (letters "a" through "z") or the character u005f ("_") or any other character larger than u007f.

NUMERIC

Any of the characters in the range u0030 through u0039 (digits "0" through "9")

ALPHANUMERIC

Any character which is either ALPHABETIC or NUMERIC

HEXADECIMAL

Any NUMERIC character or a characters in the range u0041 through u0046 ("A" through "F") or in the range u0061 through u0066 ("a" through "f")

SPECIAL

Any character that is not WHITESPACE, ALPHABETIC, nor NUMERIC

Token requirements

Processing is left-to-right. This seems obvious, but it needs to be explicitly stated.

H41010: SQLite shall divide input SQL text into tokens working from left to right.

The standard practice in SQL, as with most context-free grammar based programming languages, is to resolve ambiguities in tokenizing by selecting the option that results in the longest tokens.

H41020: At each step in the SQL tokenization process, SQLite shall extract the longest possible token from the remaining input text.

The tokenizer recognizes tokens one by one and passes them on to the parser. Except whitespace is ignored. The only use for whitespace is as a separator between tokens.

H41030: The tokenizer shall pass each non-WHITESPACE token seen on to the parser in the order in which the tokens are seen.

The tokenizer appends a semicolon to the end of input if necessary. This ensures that every SQL statement is terminated by a semicolon.

H41040: When the tokenizer reaches the end of input where the last token sent to the parser was not a SEMI token, it shall send a SEMI token to the parser.

An unrecognized token generates an immediate error and aborts the parse.

H41050: When the tokenizer encounters text that is not a valid token, it shall cause an error to be returned to the application.

Whitespace tokens

Whitespace has the usual definition.

H41100: SQLite shall recognize a sequence of one or more WHITESPACE characters as a WHITESPACE token.

An SQL comment is "--" through the end of line and is understood as whitespace.

H41110: SQLite shall recognize as a WHITESPACE token the two-character sequence "--" (u002d, u002d) followed by any sequence of non-zero characters up through and including the first u000a character or until end of input.

A C-style comment "/*...*/" is also recognized as white-space.

H41120: SQLite shall recognize as a WHITESPACE token the two-character sequence "/*" (u002f, u002a) followed by any sequence of zero or more non-zero characters through with the first "*/" (u002a, u002f) sequence or until end of input.

Identifier tokens

Identifiers follow the usual rules with the exception that SQLite allows the dollar-sign symbol in the interior of an identifier. The dollar-sign is for compatibility with Microsoft SQL-Server and is not part of the SQL standard.

H41130: SQLite shall recognize as an ID token any sequence of characters that begins with an ALPHABETIC character and continue with zero or more ALPHANUMERIC characters and/or "$" (u0024) characters and which is not a keyword token.

Identifiers can be arbitrary character strings within square brackets. This feature is also for compatibility with Microsoft SQL-Server and not a part of the SQL standard.

H41140: SQLite shall recognize as an ID token any sequence of non-zero characters that begins with "[" (u005b) and continuing through the first "]" (u005d) character.

The standard way of quoting SQL identifiers is to use double-quotes.

H41150: SQLite shall recognize as an ID token any sequence of characters that begins with a double-quote (u0022), is followed by zero or more non-zero characters and/or pairs of double-quotes (u0022) and terminates with a double-quote (u0022) that is not part of a pair.

MySQL allows identifiers to be quoted using the grave accent character. SQLite supports this for interoperability.

H41160: SQLite shall recognize as an ID token any sequence of characters that begins with a grave accent (u0060), is followed by zero or more non-zero characters and/or pairs ofgrave accents (u0060) and terminates with a grave accent (u0022) that is not part of a pair.

Literals

This is the usual definition of string literals for SQL. SQL uses the classic Pascal string literal format.

H41200: SQLite shall recognize as a STRING token a sequence of characters that begins with a single-quote (u0027), is followed by zero or more non-zero characters and/or pairs of single-quotes (u0027) and terminates with a single-quote (u0027) that is not part of a pair.

Blob literals are similar to string literals except that they begin with a single "X" character and contain hexadecimal data.

H41210: SQLite shall recognize as a BLOB token an upper or lower-case "X" (u0058 or u0078) followed by a single-quote (u0027) followed by a number of HEXADECIMAL character that is a multiple of two and terminated by a single-quote (u0027).

Integer literals are a string of digits. The plus or minus sign that might optionally preceed an integer is not part of the integer token.

H41220: SQLite shall recognize as an INTEGER token any squence of one or more NUMERIC characters.

An "exponentiation suffix" is defined to be an upper or lower case "E" (u0045 or u0065) followed by one or more NUMERIC characters. The "E" and the NUMERIC characters may optionally be separated by a plus-sign (u002b) or a minus-sign (u002d). An exponentiation suffix is part of the definition of a FLOAT token:

H41230: SQLite shall recognize as a FLOAT token a sequence of one or more NUMERIC characters together with zero or one period (u002e) and followed by an exponentiation suffix.
H41240: SQLite shall recognize as a FLOAT token a sequence of one or more NUMERIC characters that includes exactly one period (u002e) character.

Variables

Variables are used as placeholders in SQL statements for constant values that are to be bound at start-time.

H40310: SQLite shall recognize as a VARIABLE token the a question-mark (u003f) followed by zero or more NUMERIC characters.

A "parameter name" is defined to be a sequence of one or more characters that consists of ALPHANUMERIC characters and/or dollar-signs (u0025) intermixed with pairs of colons (u003a) and optionally followed by any sequence of non-zero, non-WHITESPACE characters enclosed in parentheses (u0028 and u0029).

H40320: SQLite shall recognize as a VARIABLE token one of the characters at-sign (u0040), dollar-sign (u0024), or colon (u003a) followed by a parameter name.
H40330: SQLite shall recognize as a VARIABLE token the shape-sign (u0023) followed by a parameter name that does not begin with a NUMERIC character.

The REGISTER token is a special token used internally. It does not appear as part of the published user interface. Hence, the following is a low-level requirement:

L42040: SQLite shall recognize as a REGISTER token a sharp-sign (u0023) followed by one or more NUMERIC characters.

Operator tokens

The following sequences of special characters are recognized as tokens:

H41403: SQLite shall recognize the 1-character sequenence "-" (u002d) as token MINUS
H41406: SQLite shall recognize the 1-character sequenence "(" (u0028) as token LP
H41409: SQLite shall recognize the 1-character sequenence ")" (u0029) as token RP
H41412: SQLite shall recognize the 1-character sequenence ";" (u003b) as token SEMI
H41415: SQLite shall recognize the 1-character sequenence "+" (u002b) as token PLUS
H41418: SQLite shall recognize the 1-character sequenence "*" (u002a) as token STAR
H41421: SQLite shall recognize the 1-character sequenence "/" (u002f) as token SLASH
H41424: SQLite shall recognize the 1-character sequenence "%" (u0025) as token REM
H41427: SQLite shall recognize the 1-character sequenence "=" (u003d) as token EQ
H41430: SQLite shall recognize the 2-character sequenence "==" (u003d u003d) as token EQ
H41433: SQLite shall recognize the 2-character sequenence "<=" (u003c u003d) as token LE
H41436: SQLite shall recognize the 2-character sequenence "<>" (u003c u003e) as token NE
H41439: SQLite shall recognize the 2-character sequenence "<<" (u003c u003c) as token LSHIFT
H41442: SQLite shall recognize the 1-character sequenence "<" (u003c) as token LT
H41445: SQLite shall recognize the 2-character sequenence ">=" (u003e u003d) as token GE
H41448: SQLite shall recognize the 2-character sequenence ">>" (u003e u003e) as token RSHIFT
H41451: SQLite shall recognize the 1-character sequenence ">" (u003e) as token GT
H41454: SQLite shall recognize the 2-character sequenence "!=" (u0021 u003d) as token NE
H41457: SQLite shall recognize the 1-character sequenence "," (u002c) as token COMMA
H41460: SQLite shall recognize the 1-character sequenence "&" (u0026) as token BITAND
H41463: SQLite shall recognize the 1-character sequenence "~" (u007e) as token BITNOT
H41466: SQLite shall recognize the 1-character sequenence "|" (u007c) as token BITOR
H41469: SQLite shall recognize the 2-character sequenence "||" (u007c u007c) as token CONCAT
H41472: SQLite shall recognize the 1-character sequenence "." (u002e) as token DOT

Keyword tokens

The following keywords are recognized as distinct tokens:

H41503: SQLite shall recognize the 5-character sequenence "ABORT" in any combination of upper and lower caseletters as the keyword token ABORT
H41506: SQLite shall recognize the 3-character sequenence "ADD" in any combination of upper and lower caseletters as the keyword token ADD
H41509: SQLite shall recognize the 5-character sequenence "AFTER" in any combination of upper and lower caseletters as the keyword token AFTER
H41512: SQLite shall recognize the 3-character sequenence "ALL" in any combination of upper and lower caseletters as the keyword token ALL
H41515: SQLite shall recognize the 5-character sequenence "ALTER" in any combination of upper and lower caseletters as the keyword token ALTER
H41518: SQLite shall recognize the 7-character sequenence "ANALYZE" in any combination of upper and lower caseletters as the keyword token ANALYZE
H41521: SQLite shall recognize the 3-character sequenence "AND" in any combination of upper and lower caseletters as the keyword token AND
H41524: SQLite shall recognize the 2-character sequenence "AS" in any combination of upper and lower caseletters as the keyword token AS
H41527: SQLite shall recognize the 3-character sequenence "ASC" in any combination of upper and lower caseletters as the keyword token ASC
H41530: SQLite shall recognize the 6-character sequenence "ATTACH" in any combination of upper and lower caseletters as the keyword token ATTACH
H41533: SQLite shall recognize the 13-character sequenence "AUTOINCREMENT" in any combination of upper and lower caseletters as the keyword token AUTOINCR
H41536: SQLite shall recognize the 6-character sequenence "BEFORE" in any combination of upper and lower caseletters as the keyword token BEFORE
H41539: SQLite shall recognize the 5-character sequenence "BEGIN" in any combination of upper and lower caseletters as the keyword token BEGIN
H41542: SQLite shall recognize the 7-character sequenence "BETWEEN" in any combination of upper and lower caseletters as the keyword token BETWEEN
H41545: SQLite shall recognize the 2-character sequenence "BY" in any combination of upper and lower caseletters as the keyword token BY
H41548: SQLite shall recognize the 7-character sequenence "CASCADE" in any combination of upper and lower caseletters as the keyword token CASCADE
H41551: SQLite shall recognize the 4-character sequenence "CASE" in any combination of upper and lower caseletters as the keyword token CASE
H41554: SQLite shall recognize the 4-character sequenence "CAST" in any combination of upper and lower caseletters as the keyword token CAST
H41557: SQLite shall recognize the 5-character sequenence "CHECK" in any combination of upper and lower caseletters as the keyword token CHECK
H41560: SQLite shall recognize the 7-character sequenence "COLLATE" in any combination of upper and lower caseletters as the keyword token COLLATE
H41563: SQLite shall recognize the 6-character sequenence "COLUMN" in any combination of upper and lower caseletters as the keyword token COLUMNKW
H41566: SQLite shall recognize the 6-character sequenence "COMMIT" in any combination of upper and lower caseletters as the keyword token COMMIT
H41569: SQLite shall recognize the 8-character sequenence "CONFLICT" in any combination of upper and lower caseletters as the keyword token CONFLICT
H41572: SQLite shall recognize the 10-character sequenence "CONSTRAINT" in any combination of upper and lower caseletters as the keyword token CONSTRAINT
H41575: SQLite shall recognize the 6-character sequenence "CREATE" in any combination of upper and lower caseletters as the keyword token CREATE
H41578: SQLite shall recognize the 5-character sequenence "CROSS" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41581: SQLite shall recognize the 12-character sequenence "CURRENT_DATE" in any combination of upper and lower caseletters as the keyword token CTIME_KW
H41584: SQLite shall recognize the 12-character sequenence "CURRENT_TIME" in any combination of upper and lower caseletters as the keyword token CTIME_KW
H41587: SQLite shall recognize the 17-character sequenence "CURRENT_TIMESTAMP" in any combination of upper and lower caseletters as the keyword token CTIME_KW
H41590: SQLite shall recognize the 8-character sequenence "DATABASE" in any combination of upper and lower caseletters as the keyword token DATABASE
H41593: SQLite shall recognize the 7-character sequenence "DEFAULT" in any combination of upper and lower caseletters as the keyword token DEFAULT
H41596: SQLite shall recognize the 8-character sequenence "DEFERRED" in any combination of upper and lower caseletters as the keyword token DEFERRED
H41599: SQLite shall recognize the 10-character sequenence "DEFERRABLE" in any combination of upper and lower caseletters as the keyword token DEFERRABLE
H41602: SQLite shall recognize the 6-character sequenence "DELETE" in any combination of upper and lower caseletters as the keyword token DELETE
H41605: SQLite shall recognize the 4-character sequenence "DESC" in any combination of upper and lower caseletters as the keyword token DESC
H41608: SQLite shall recognize the 6-character sequenence "DETACH" in any combination of upper and lower caseletters as the keyword token DETACH
H41611: SQLite shall recognize the 8-character sequenence "DISTINCT" in any combination of upper and lower caseletters as the keyword token DISTINCT
H41614: SQLite shall recognize the 4-character sequenence "DROP" in any combination of upper and lower caseletters as the keyword token DROP
H41617: SQLite shall recognize the 3-character sequenence "END" in any combination of upper and lower caseletters as the keyword token END
H41620: SQLite shall recognize the 4-character sequenence "EACH" in any combination of upper and lower caseletters as the keyword token EACH
H41623: SQLite shall recognize the 4-character sequenence "ELSE" in any combination of upper and lower caseletters as the keyword token ELSE
H41626: SQLite shall recognize the 6-character sequenence "ESCAPE" in any combination of upper and lower caseletters as the keyword token ESCAPE
H41629: SQLite shall recognize the 6-character sequenence "EXCEPT" in any combination of upper and lower caseletters as the keyword token EXCEPT
H41632: SQLite shall recognize the 9-character sequenence "EXCLUSIVE" in any combination of upper and lower caseletters as the keyword token EXCLUSIVE
H41635: SQLite shall recognize the 6-character sequenence "EXISTS" in any combination of upper and lower caseletters as the keyword token EXISTS
H41638: SQLite shall recognize the 7-character sequenence "EXPLAIN" in any combination of upper and lower caseletters as the keyword token EXPLAIN
H41641: SQLite shall recognize the 4-character sequenence "FAIL" in any combination of upper and lower caseletters as the keyword token FAIL
H41644: SQLite shall recognize the 3-character sequenence "FOR" in any combination of upper and lower caseletters as the keyword token FOR
H41647: SQLite shall recognize the 7-character sequenence "FOREIGN" in any combination of upper and lower caseletters as the keyword token FOREIGN
H41650: SQLite shall recognize the 4-character sequenence "FROM" in any combination of upper and lower caseletters as the keyword token FROM
H41653: SQLite shall recognize the 4-character sequenence "FULL" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41656: SQLite shall recognize the 4-character sequenence "GLOB" in any combination of upper and lower caseletters as the keyword token LIKE_KW
H41659: SQLite shall recognize the 5-character sequenence "GROUP" in any combination of upper and lower caseletters as the keyword token GROUP
H41662: SQLite shall recognize the 6-character sequenence "HAVING" in any combination of upper and lower caseletters as the keyword token HAVING
H41665: SQLite shall recognize the 2-character sequenence "IF" in any combination of upper and lower caseletters as the keyword token IF
H41668: SQLite shall recognize the 6-character sequenence "IGNORE" in any combination of upper and lower caseletters as the keyword token IGNORE
H41671: SQLite shall recognize the 9-character sequenence "IMMEDIATE" in any combination of upper and lower caseletters as the keyword token IMMEDIATE
H41674: SQLite shall recognize the 2-character sequenence "IN" in any combination of upper and lower caseletters as the keyword token IN
H41677: SQLite shall recognize the 5-character sequenence "INDEX" in any combination of upper and lower caseletters as the keyword token INDEX
H41680: SQLite shall recognize the 9-character sequenence "INITIALLY" in any combination of upper and lower caseletters as the keyword token INITIALLY
H41683: SQLite shall recognize the 5-character sequenence "INNER" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41686: SQLite shall recognize the 6-character sequenence "INSERT" in any combination of upper and lower caseletters as the keyword token INSERT
H41689: SQLite shall recognize the 7-character sequenence "INSTEAD" in any combination of upper and lower caseletters as the keyword token INSTEAD
H41692: SQLite shall recognize the 9-character sequenence "INTERSECT" in any combination of upper and lower caseletters as the keyword token INTERSECT
H41695: SQLite shall recognize the 4-character sequenence "INTO" in any combination of upper and lower caseletters as the keyword token INTO
H41698: SQLite shall recognize the 2-character sequenence "IS" in any combination of upper and lower caseletters as the keyword token IS
H41701: SQLite shall recognize the 6-character sequenence "ISNULL" in any combination of upper and lower caseletters as the keyword token ISNULL
H41704: SQLite shall recognize the 4-character sequenence "JOIN" in any combination of upper and lower caseletters as the keyword token JOIN
H41707: SQLite shall recognize the 3-character sequenence "KEY" in any combination of upper and lower caseletters as the keyword token KEY
H41710: SQLite shall recognize the 4-character sequenence "LEFT" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41713: SQLite shall recognize the 4-character sequenence "LIKE" in any combination of upper and lower caseletters as the keyword token LIKE_KW
H41716: SQLite shall recognize the 5-character sequenence "LIMIT" in any combination of upper and lower caseletters as the keyword token LIMIT
H41719: SQLite shall recognize the 5-character sequenence "MATCH" in any combination of upper and lower caseletters as the keyword token MATCH
H41722: SQLite shall recognize the 7-character sequenence "NATURAL" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41725: SQLite shall recognize the 3-character sequenence "NOT" in any combination of upper and lower caseletters as the keyword token NOT
H41728: SQLite shall recognize the 7-character sequenence "NOTNULL" in any combination of upper and lower caseletters as the keyword token NOTNULL
H41731: SQLite shall recognize the 4-character sequenence "NULL" in any combination of upper and lower caseletters as the keyword token NULL
H41734: SQLite shall recognize the 2-character sequenence "OF" in any combination of upper and lower caseletters as the keyword token OF
H41737: SQLite shall recognize the 6-character sequenence "OFFSET" in any combination of upper and lower caseletters as the keyword token OFFSET
H41740: SQLite shall recognize the 2-character sequenence "ON" in any combination of upper and lower caseletters as the keyword token ON
H41743: SQLite shall recognize the 2-character sequenence "OR" in any combination of upper and lower caseletters as the keyword token OR
H41746: SQLite shall recognize the 5-character sequenence "ORDER" in any combination of upper and lower caseletters as the keyword token ORDER
H41749: SQLite shall recognize the 5-character sequenence "OUTER" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41752: SQLite shall recognize the 4-character sequenence "PLAN" in any combination of upper and lower caseletters as the keyword token PLAN
H41755: SQLite shall recognize the 6-character sequenence "PRAGMA" in any combination of upper and lower caseletters as the keyword token PRAGMA
H41758: SQLite shall recognize the 7-character sequenence "PRIMARY" in any combination of upper and lower caseletters as the keyword token PRIMARY
H41761: SQLite shall recognize the 5-character sequenence "QUERY" in any combination of upper and lower caseletters as the keyword token QUERY
H41764: SQLite shall recognize the 5-character sequenence "RAISE" in any combination of upper and lower caseletters as the keyword token RAISE
H41767: SQLite shall recognize the 10-character sequenence "REFERENCES" in any combination of upper and lower caseletters as the keyword token REFERENCES
H41770: SQLite shall recognize the 6-character sequenence "REGEXP" in any combination of upper and lower caseletters as the keyword token LIKE_KW
H41773: SQLite shall recognize the 7-character sequenence "REINDEX" in any combination of upper and lower caseletters as the keyword token REINDEX
H41776: SQLite shall recognize the 6-character sequenence "RENAME" in any combination of upper and lower caseletters as the keyword token RENAME
H41779: SQLite shall recognize the 7-character sequenence "REPLACE" in any combination of upper and lower caseletters as the keyword token REPLACE
H41782: SQLite shall recognize the 8-character sequenence "RESTRICT" in any combination of upper and lower caseletters as the keyword token RESTRICT
H41785: SQLite shall recognize the 5-character sequenence "RIGHT" in any combination of upper and lower caseletters as the keyword token JOIN_KW
H41788: SQLite shall recognize the 8-character sequenence "ROLLBACK" in any combination of upper and lower caseletters as the keyword token ROLLBACK
H41791: SQLite shall recognize the 3-character sequenence "ROW" in any combination of upper and lower caseletters as the keyword token ROW
H41794: SQLite shall recognize the 6-character sequenence "SELECT" in any combination of upper and lower caseletters as the keyword token SELECT
H41797: SQLite shall recognize the 3-character sequenence "SET" in any combination of upper and lower caseletters as the keyword token SET
H41800: SQLite shall recognize the 5-character sequenence "TABLE" in any combination of upper and lower caseletters as the keyword token TABLE
H41803: SQLite shall recognize the 4-character sequenence "TEMP" in any combination of upper and lower caseletters as the keyword token TEMP
H41806: SQLite shall recognize the 9-character sequenence "TEMPORARY" in any combination of upper and lower caseletters as the keyword token TEMP
H41809: SQLite shall recognize the 4-character sequenence "THEN" in any combination of upper and lower caseletters as the keyword token THEN
H41812: SQLite shall recognize the 2-character sequenence "TO" in any combination of upper and lower caseletters as the keyword token TO
H41815: SQLite shall recognize the 11-character sequenence "TRANSACTION" in any combination of upper and lower caseletters as the keyword token TRANSACTION
H41818: SQLite shall recognize the 7-character sequenence "TRIGGER" in any combination of upper and lower caseletters as the keyword token TRIGGER
H41821: SQLite shall recognize the 5-character sequenence "UNION" in any combination of upper and lower caseletters as the keyword token UNION
H41824: SQLite shall recognize the 6-character sequenence "UNIQUE" in any combination of upper and lower caseletters as the keyword token UNIQUE
H41827: SQLite shall recognize the 6-character sequenence "UPDATE" in any combination of upper and lower caseletters as the keyword token UPDATE
H41830: SQLite shall recognize the 5-character sequenence "USING" in any combination of upper and lower caseletters as the keyword token USING
H41833: SQLite shall recognize the 6-character sequenence "VACUUM" in any combination of upper and lower caseletters as the keyword token VACUUM
H41836: SQLite shall recognize the 6-character sequenence "VALUES" in any combination of upper and lower caseletters as the keyword token VALUES
H41839: SQLite shall recognize the 4-character sequenence "VIEW" in any combination of upper and lower caseletters as the keyword token VIEW
H41842: SQLite shall recognize the 7-character sequenence "VIRTUAL" in any combination of upper and lower caseletters as the keyword token VIRTUAL
H41845: SQLite shall recognize the 4-character sequenence "WHEN" in any combination of upper and lower caseletters as the keyword token WHEN
H41848: SQLite shall recognize the 5-character sequenence "WHERE" in any combination of upper and lower caseletters as the keyword token WHERE

This page last modified 2008/08/07 20:09:02 UTC