2.1 Example

As a running example we will specify, throughout the manual, a front-end for a compiler or an interpreter for a small functional language Lambda. In this section we will define the scanner for this language, in Section 3.1 we build a parser on top of this scanner.

2.1.1 Writing a Scanner Specification

Program 2.1 shows the specification of the sample scanner we will consider in this section. In the following we will examine this example line by line.

Class Descriptors

At the first glance the scanner specification closely resembles a class definition with some extra elements, introduced by the keyword scanner instead of class. This is intentional, since it will ultimately be replaced by a class. This is why all descriptors allowed in a class definition are also allowed at the beginning of a scanner specification. Consider the from, attr and meth constructs used in lines 2 to 10.

Lexical Abbreviations

The scanner-specific declarations begin at line 12. Two kinds of definition can be introduced by the keyword lex: either a lexical abbreviation, as seen in lines 12 to 15, or a lexical rule as found from line 17 to the end of the specification. A lexical abbreviation lex I = <R> end associates an identifier I with a given regular expression R. Occurrences of {I} in other regular expressions are then replaced to (R).

Note that regular expressions use the same syntax as regular expressions in flex [Pax95], with a few exceptions (detailed in Section 2.2.1). Furthermore, we must either enclose them in angle brackets or give them as Oz strings. (The latter proves useful when the angle-bracketed version confuses Emacs' fontification mode, but is a bit harder to read, since more characters must be escaped.)

The example defines four lexical abbreviations: digit stands for a decimal digit, letter for an uppercase or lowercase letter; id defines the syntax of identifiers to consist of a letter, followed by an arbitrary sequence of letters and digits; and finally, int defines the syntax of positive decimal integers as a nonempty sequence of digits.

Lexical Rules

Lexical rules of the form lex <R> S end are more interesting, since the set of these is the actual scanner specification. Upon a match of a prefix of the input character stream with the regular expression R, the statement S is executed as a method body (i. e., self may be accessed and modified). Two methods are provided by the mixin class GumpScanner.'class' (inherited from in line 2) to append tokens to the token stream: putToken1, which appends a token of a given class without a value (unit being used instead), and putToken, which allows a specific token value to be provided. Token classes may be represented by arbitrary Oz values, but the parser generator in Chapter 3 expects them to be atoms. In lines 18 and 21 you can see how constants are used as token classes. In line 33 the token class is computed from the lexeme.

Accessing the Lexeme

The lexeme itself may be accessed in several ways. The method getAtom returns the lexeme as an atom, which is the representation for identifier token values chosen in line 25. The method getString returns the lexeme as a string, such as in line 28, where it is subsequently converted to an integer.

The remaining lexical rules are easily explained. Lines 36 and 37 respectively describe how whitespace and comments are to be ignored. This is done by neither calling putToken1 nor putToken. (Note that an action can also invoke them several times to append multiple tokens to the token stream, just as it may chose not to invoke them at all to simply ignore the lexeme or only produce side effects.) The rule in line 38 ignores any matched newlines, but updates the line counter attribute LineNumber as it does so. The rule in line 41 reports any remaining unmatched characters in the input as lexical errors and returns the token 'error' which the parser can recognize as an erroneous token.

End-of-File Rules

The final rule, in line 46, has the special syntax <<EOF>> (it might also have been written as "<<EOF>>") and only matches the end of the character stream. It returns the token 'EOF' which can be recognized by the parser as the end of input. Note that the action might just as well open another file to read from.

More information about acceptable sets of regular expressions in scanner specifications, conflict resolution and grouping into lexical modes is given in Section 2.2.1.


declare 
scanner LambdaScanner from GumpScanner.'class' 
   attr LineNumber
   meth init()
      GumpScanner.'class', init()
      LineNumber <- 1
   end 
   meth getLineNumber($)
      @LineNumber
   end 
 
   lex digit = <[0-9]> end 
   lex letter = <[A-Za-z]> end 
   lex id = <{letter}({letter}|{digit})*> end 
   lex int = <{digit}+> end 
 
   lex <define> 
      GumpScanner.'class', putToken1('define')
   end 
   lex <lambda> 
      GumpScanner.'class', putToken1('lambda')
   end 
   lex <{id}> A in 
      GumpScanner.'class', getAtom(?A)
      GumpScanner.'class', putToken('id' A)
   end 
   lex <{int}> S in 
      GumpScanner.'class', getString(?S)
      GumpScanner.'class', putToken('int' {String.toInt S})
   end 
   lex <"."|"("|")"|"="|";"> A in 
      GumpScanner.'class', getAtom(?A)
      GumpScanner.'class', putToken1(A)
   end 
 
   lex <[ \t]> skip end 
   lex <"%".*> skip end 
   lex <\n> 
      LineNumber <- @LineNumber + 1
   end 
   lex <.> 
      {System.showInfo 'line '#@LineNumber#': unrecognized character'}
      GumpScanner.'class', putToken1('error')
   end 
 
   lex <<EOF>> 
      GumpScanner.'class', putToken1('EOF')
   end 
end 

Program 2.1: The LambdaScanner scanner specification.


2.1.2 Invoking Gump

Now that we have finished writing our specification, we want to translate it into an Oz class definition that implements our scanner. For this, we issue the compiler directive

\switch +gump

whereupon the compiler will accept Gump specifications.

Running Gump

Save the above specification in a file LambdaScanner.ozg. The extension .ozg indicates that this file contains Oz code with additional Gump definitions, so that Emacs will fontify Gump definitions correctly. Feeding

\insert LambdaScanner.ozg

will process this file. Switch to the Compiler buffer (via C-c C-c) to watch Gump's status messages and any errors occurring during the translation.

Output Files

When the translation is finished, you will notice several new files in the current working directory. These will be named after your scanner specification. Suppose your scanner was called S, then you will find files S.l, S.C, S.o and S.so. The first three are intermediate results (respectively the input file for flex, the flex-generated C++ file and the object code produced by the C++ compiler) and the last one is the resulting dynamic library used by the generated scanner.

Implementation Limitation

Note that due to limitations of dynamic linking, a scanner may only be loaded once into the system. When interactively developing a scanner, this means that you will not see changes you make to the set and order of the regular expressions consistently. You should thus halt and restart Mozart each time you make changes to the regular expressions.

See also Section 2.2.2 for a workaround around this limitation.

2.1.3 Using the Generated Scanner

Program 2.2 shows a sample program running our generated scanner.

The generated LambdaScanner class is instantiated as MyScanner. We have to call the method init() first to initialize the internal structures of the GumpScanner.'class'.

Requesting Tokens

The procedure GetTokens repeatedly invokes the GumpScanner.'class' method

getToken(?X ?Y)

which returns the next token's token class in X and token value in Y and removes it from the token stream. GetTokens exits when the end of the token stream is reached, which is recognized by the token class 'EOF'.

Providing Inputs

To actually start scanning we have to provide an input character stream. This is done via one of the methods

scanFile(+FileName)

or

scanVirtualString(+V)

Each of these pushes the currently used buffer (if any) upon an internal stack of buffers and builds a new buffer from the given source. Each time the end of a buffer is reached, the <<EOF>> rule is matched. This may pop a buffer and continue scanning the next-outer buffer where it left off, using the closeBuffer method described in Section 2.2.3.

Closing Scanners

When a scanner is not used anymore, it should be sent the message

close()

so that it can close any open files and release any allocated buffers. (This is even necessary when scanning virtual strings due to the underlying implementation in C++.)

The following is a sample input for the scanner. The above example expects this to be placed in the file Lambda.in in the current directory:

% some input to test the class LambdaScanner
define f = lambda y.lambda z.(add y z);
define c = 17;
f c 7;
((fc7 


\switch +gump 
\insert gump/examples/LambdaScanner.ozg 
 
local 
   MyScanner = {New LambdaScanner init()}
   proc {GetTokens} T V in 
      {MyScanner getToken(?T ?V)}
      case T of 'EOF' then 
         {System.showInfo 'End of file reached.'}
      else 
         {System.show T#V}
         {GetTokens}
      end 
   end 
in 
   {MyScanner scanFile('Lambda.in')}
   {GetTokens}
   {MyScanner close()}
end 

Program 2.2: A program making use of the generated scanner.



Leif Kornstaedt
Version 1.4.0 (20080702)