Simple Tokenizer

Torbjörn Lager



Module Tokenizer.ozf exports a class that implements a simple tokenizer for natural language. Given a string, it returns a list of strings (but this can be changed to e.g. a list of atoms by subclassing), where each string is a considered a token. The tokenizer has a reasonable default behaviour for most European languages (well, for English and Swedish at least...), and it can be tailored to specific languages and applications by subclassing. For example, the tokenizer for English separate contractions into multiple tokens, e.g. splits the word "don't" into two tokens "do" and "n't", where "n't" is treated as a special form of "not". The word "John's" is treated as two tokens "John" and "'s". It is done in this way because this is how the Brill tagger wants it.

It is important to emphasize that this is a simple program. It was written to be used by the Brill tagger, but since it is also independently useful, I decided to make it available separately.


Download the package, and invoke ozmake in a shell as follows:

ozmake --install --package=lager-simple-tokenizer.pkg

By default, all files of the package are installed in the user's ~/.oz directory tree. In particular, all modules are installed in the user's private cache.



Module Tokenizer.ozf exports, on feature class, a class definition for a tokenizer for natural language. It is up to each application to specialize the methods for individual natural languages.


An initialization method (which doesn't really do anything).
tokenize(String ?Tokens)
Tokens gets bound to the tokens in String.

These are the overridable methods that control how the tokenizer works:

isWordChar(C ?B)
B is bound to true if C is to be handled as part of a word.
isPunctuationChar(C ?B)
B is bound to true if C is to be handled as a punctuation char.
toToken(Cs ?Token)
Determines how an individual token will be built from a list of characters.
postProcess(TokensIn ?TokensOut)
A hook for a post processor.


The package also contains a tokenizer for English that is implemented by subclassing Tokenizer.

Example Application

The distribution also include a stand-alone application which prints each token on a separate line. It can be invoked in the following way on a text file:

tokenize --in=test.txt

Torbjörn Lager