mogul:/lager/simple-tokenizer

type	:	package
id	:	`mogul:/lager/simple-tokenizer`
section	:	mogul:/lager
blurb	:	A simple tokenizer for natural language
author	:	Torbjörn Lager
category	:	`nlp`
documentation	:	index.html
download	:	lager-simple-tokenizer__1.2.5__source__0.pkg lager-simple-tokenizer__1.3.0__source__0.pkg
provides	:	`[nlp] x-ozlib://lager/simple-tokenizer/Tokenizer.ozf` `[nlp] x-ozlib://lager/simple-tokenizer/EnglishTokenizer.ozf` `[nlp] x-ozlib://lager/simple-tokenizer/tokenize.exe`

Module Tokenizer.ozf exports a class that implements a simple tokenizer for natural language. Given a string, it returns a list of strings (but this can be changed to e.g. a list of atoms by subclassing), where each string is a considered a token. The tokenizer has a reasonable default behaviour for most European languages (well, for English and Swedish at least...), and it can be tailored to specific languages and applications by subclassing. For example, the tokenizer for English separate contractions into multiple tokens, e.g. splits the word "don't" into two tokens "do" and "n't", where "n't" is treated as a special form of "not". The word "John's" is treated as two tokens "John" and "'s". It is done in this way because this is how the Brill tagger wants it.

It is important to emphasize that this is a simple program. It was written to be used by the Brill tagger, but since it is also independently useful, it is made available separately.

Simple Tokenizer