Gump Tokenizer

Torbjörn Lager



This is a natural language tokenizer based on Gump (which in itself is based on Flex). It improves upon the simple tokenizer, and represents a much better approach to tokenization of natural language. Among other things, it does not need the company of a sentence splitter, since it handles sentence splitting all by itself. It is however somewhat slower - and more heavyweight - than the simple tokenizer.

Although the tokenizer in this package is set up for English, it should be fairly straightforward to port to other (similar) languages. However, note that a tokenizer for natural language often needs to be tuned, not only to a particular language, but also to the kind of texts on which it is going to be used.

Token classes

In the present version of the tokenizer, only four token classes are distinguished:

p Paragraph delimiter (ends a paragraph)
s Sentence delimiter (ends a sentence)
w 'Word' (includes ordinary words, but also abbreviations, numbers, etc.)
c Other (separators, etc.)

As can be seen from the actual Gump definitions (in the source file 'EnglishTokenizer.oz'), it would be possible to have a more fine-grained set of classes (recognizing e.g. abbreviations, dates, etc.), but the risk for misclassification would greatly increase.

The tokenizer separate contractions into multiple tokens, e.g. splits the word "don't" into two tokens "do" and "n't", where "n't" is treated as a special form of "not". The word "John's" is treated as two tokens "John" and "'s". It is done in this way because this is how the Brill tagger wants it.


Download the package, and invoke ozmake in a shell as follows:

ozmake --install --package=lager-gump-tokenizer.pkg

By default, all files of the package are installed in the user's ~/.oz directory tree. In particular, all modules are installed in the user's private cache.



Tokenizer.'class' defines functionality, inherited from Gump, that can be used by users of the generated tokenizer. Listed below is only a part of what is available. Refer to the Gump manual for more information.

meth init()
This initializes the internal structures of the tokenizer. This must be called before any other method of this class.
meth getToken(?X Y)
The next token is removed from the token stream and returned. The token class is returned in X and its value in Y. Both X and Y are atoms.
meth scanFile(+F)
A new buffer is created from the file with name F and tokenized. If the file does not exist, the error exception gump(fileNotFound F) with the filename in F is raised.
meth scanVirtualString(+V)
Like scanFile, but scans a virtual string V.
meth close()
Closes all buffers. Before calling any other methods, you should call init() again.


This is how a we (in the OPI) write a function GetSentence that will retrieve one sentence (list of words) from the tokenizer each time it is called:


%% Link functor, get module
[Tokenizer] = { ['x-ozlib://lager/gump-tokenizer/EnglishTokenizer.ozf']}

%% Create and initialize Tokenizer object
MyTokenizer = {New Tokenizer.'class' init()}

%% Tokenize file
{MyTokenizer scanFile('test.txt')}

fun {GetSentence} T V in
   {MyTokenizer getToken(?T ?V)}
   case T
   of 'EOF' then nil
   [] p then nil
   [] s then [V]
   [] w then V|{GetSentence}
   [] c then V|{GetSentence}

%% Each time you feed this the Inspector will
%% show a different sentence from 'test.txt'
%% and 'nil' when there are no sentences left
{Inspect {GetSentence}}

/* Feed this to close tokenizer when you're done.
{MyTokenizer close()}

Example Application

The distribution also include a stand-alone application which prints each token and its class on a separate line. It can be invoked in the following way on a text file:

tokenize --in=test.txt

Torbjörn Lager