2.2 Reference

This section is intended to serve as a reference for the user of the Gump Scanner Generator. It details the syntax of the embedded scanner specification language in Section 2.2.1, which options are supported and how they are specified in Section 2.2.2 and finally the runtime part of the Scanner Generator, the mixin class GumpScanner.'class', in Section 2.2.3.

2.2.1 Syntax of the Scanner Specification Language

The notation used here for specifying the syntax of the specification language is a variant of BNF and is defined in Appendix A.

A scanner specification is allowed anywhere as an Oz statement:

<statement> += <scanner specification>

It is similar to a class definition, except that it is introduced by the keyword scanner, must be named by a variable (and not an arbitrary term), since this is used for assigning file names, and allows for additional descriptors after the usual class descriptors.

<scanner specification> ::= scanner <variable>
{ <class descriptor> }
{ <method> }
{ <scanner descriptor> }+
end

A lexical abbreviation associates an identifier with a regular expression, which can then be referenced in subsequent lexical abbreviations or any lexical rules by enclosing the identifier in curly brackets. The regular expression is additionally parenthesized when it is expanded.

<lexical abbreviation> ::= lex <atom> "=" <regex> end
 | lex <variable> "=" <regex> end

The definition of a lexical rule is similar to the definition of a method. However, its head consists of a regular expression; when this is matched, the body of the lexical rule is executed (as a method).

<lexical rule> ::= lex <regex>
<in statement>
end

Regular expressions may be annotated with lexical modes. Each lexical mode constitutes an independent sub-scanner: At any time a certain mode is active; in this mode only the regular expressions annotated with it will be matched. All lexical rules defined within the scope of a lexical mode are annotated with this lexical mode. A lexical mode may inherit from other lexical modes; all regular expressions in these modes are then annotated with the inheriting lexical mode as well. Lexical modes implicitly inherit from all lexical modes they are nested in. Lexical rules written at top-level are annotated with the implicitly declared mode INITIAL.

<lexical mode> ::= mode <variable> [ from { <variable> }+ ]
{ <mode descriptor> }
end

<mode descriptor> ::= <lexical rule>
 | <lexical mode>

Syntax of Regular Expressions

Regular expressions <regex> correspond to the regular expressions used in flex [Pax95] with a few exceptions:

Due to the underlying use of flex, the names of lexical abbreviations are restricted to the syntax allowed in flex name definitions.

Ambiguities and Errors in the Rule Set

Tokenization is performed by a left-to-right scan of the input character stream. If several rules match a prefix of the input, then the rule matching the longest prefix is preferred. If several rules match the same (longest) prefix of the input, then two rules may be applied to disambiguate the match (see Section 2.2.2 on how to select the rule):

First-fit.

The rule notated first in the scanner specification is preferred. In this case, every conflict can be uniquely resolved. Two errors in the rule set are possible: holes and completely covered rules (see below).

Best-fit.

Suppose two conflicting rules are rule r1 and rule r2, which are annotated by sets of lexical modes S1 and S2 respectively. Then r1 is preferred over r2 if and only if the following condition holds:

S1 is a subset of S2 and L(r1) is a subset of L(r2)

where L(r) is the language generated by a regular expression r, that is, the set of strings that match r. Intuitively, this rule means that r1 is ``more specialized than'' r2. Additionally to the errors possible in the rule set in the first-fit case, here the situation may arise that the rule set is not well-ordered w. r. t. the ``more specialized than'' relation.

The following errors in the rule set may occur:

Holes in the rule set.

For some input (in some mode), no true prefix is matched by any rule. Due to the underlying implementation using flex, this will result in the warning message

"-s option given but default rule can be matched"

If at run-time some such input is encountered, this will result in an error exception

"flex scanner jammed"

Completely covered rules.

A rule r is never matched because for every prefix in L(r) exists another rule r which is preferred over r.

Non well-orderedness.

Two rules r1 and r2 are in conflict in the best-fit case, but neither is r1 more specialized than r2 nor the other way round, and no rule or set of rules exists that covers L(r1) intersected with L(r2).

2.2.2 Parameters to Scanner Generation

The Gump Scanner Generator supports several configuration parameters, which may be set on a per-scanner basis via the use of macro directives.

Macro Directives

Due to the implementation of scanners in C++, a unique prefix is required for each scanner to avoid symbol conflicts when several scanners reside at the same time in the Mozart system. The following macro directive allows this prefix to be changed (the default zy is all right if only a single scanner is used at any time):

\gumpscannerprefix <atom>

Switches

Table 2.1 summarizes some compiler switches that control the Gump Scanner Generator.


Switch

Effect

\switch +gumpscannerbestfit

Use best-fit instead of first-fit disambiguating

\switch +gumpscannercaseless

Generate a case-insensitive scanner

\switch +gumpscannernowarn

Suppress warnings from flex

Table 2.1: Compiler switches for the Gump Scanner Generator.


2.2.3 The Mixin Class GumpScanner.'class'

The module GumpScanner defines the runtime support needed by Gump-generated scanners. All operations and data are encapsulated in the mixin class GumpScanner.'class' that scanners have to inherit from in order to be executable.

Abstract Members

The mixin class expects the following features and methods to be defined by derivate classes. (It is a good idea not to define any class members whose name begins with lex... since these may be used for internals of the Scanner Generator.)

feat lexer

This feature must contain the scanner-specific loaded foreign functions, which includes the generated scanner tables.

meth lexExecuteAction(+I)

This method is called each time a regular expression is matched. Regular expressions are assigned unique integers; I indicates which rule's associated action is to be run.

Provided Members

The GumpScanner.'class' class defines some user functionality that is to be used either by users of the generated scanner or by the semantic actions in the scanner itself.

meth init()

This initializes the internal structures of the GumpScanner.'class'. This must be called before any other method of this class.

meth setMode(+I)

The operation mode of the scanner is set to the lexical mode I. Lexical modes are represented internally as integers. Since modes are identified by variables, the class generation phase wraps a local ... end around the class equating the mode variables to the assigned unique integers.

meth currentMode(?I)

This returns the integer I identifying the lexical mode the scanner currently operates in.

meth getAtom(?A)

This method is used to access the lexeme last matched. It is returned as an atom in the variable A. Note that if the lexeme contains a NUL character (ISO 0) then only the text up to the first NUL but excluding it is returned.

meth getString(?S)

This method returns the lexeme as a string in the variable S. The restrictions concerning getAtom do not apply for getString.

meth getLength(?I)

This method returns the length of the lexeme (number of characters matched).

meth putToken(+X Y)

This method may be used to append a token with token class X and value Y to the token stream. (Actually, the token class may be an arbitrary Oz value, but atoms and the integers between 0 and 255 are the only representations understood by Gump-generated parsers.)

meth putToken1(+X)

This method may be used to append a token with token class X and value unit to the token stream.

meth getToken(?X Y)

The next token is removed from the token stream and returned. The token class is returned in X and its value in Y.

meth input(?C)

The next (unmatched) character is removed from the character stream and returned in C.

meth scanFile(+V)

This method causes the currently scanned buffer (if any) to be pushed on a stack of active buffers. A new buffer is created from the file with name V and scanned. If the file does not exist, the error exception gump(fileNotFound V) with the filename in V is raised; the default treatment is the invocation of a custom error printer.

meth scanVirtualString(+V)

Like scanFile, but scans a virtual string V. If V contains NUL characters (ISO 0) then the virtual string is only scanned up to and excluding the first NUL character.

meth setInteractive(+B)

Each buffer may be either interactive or non-interactive. An interactive buffer only reads as many characters as are needed to be considered to decide about a match; a non-interactive buffer may read ahead. This method allows the topmost buffer on the stack to be set to interactive (if B is true) or non-interactive (if B is false). New buffers are always created as non-interactive buffers.

meth getInteractive(?B)

Whether the topmost buffer on the buffer stack is interactive is returned.

meth setBOL(+B)

The beginning-of-line (BOL) flag indicates whether the beginning-of-line regular expression lex <^> will currently match the input. This flag is true at the beginning of a buffer or after a newline has been scanned. The flag's value may be set at will with this method.

meth getBOL(?B)

Returns the current state of the beginning-of-line flag. See the setBOL method.

meth closeBuffer()

Closes the topmost buffer on the buffer stack and resumes scanning from the buffer on the new stack top (if any). If the buffer stack is or becomes empty through this operation, only tokens with class 'EOF' and value unit are returned subsequently (until a new buffer is created).

meth close()

Closes all buffers on the buffer stack. Before calling any other methods, you should call init() again.


Leif Kornstaedt
Version 1.4.0 (20080702)