This is an implementation in pure Oz of the text categorization method described in
Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Gertjan van Noord's implementation in Perl, available from http://odur.let.rug.nl/~vannoord/TextCat/ provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.
There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.
Download the package, and invoke ozmake in a shell as follows:
ozmake --install --package=lager-text-categorizer.pkg
By default, all files of the package are installed in the user's
directory tree. In particular, all modules are installed in the user's private
Rankinggets bound to a list of pairs of the form
Distanceis an integer representing the distance between the model
ModelNameand the model of
String. The list is sorted in order of increasing distance.
ModelNamegets bound to the name of the model closest to the model of
ModelNames to the list of names of stored models.
File, or (by default) starts from a new model.
Stringto the model store under the name
Fileto the model store under the name
Dir, a model of the contents of
Fto the current model store. To be considered, the name of
Fmust have the form
<name>must not contain any period) and the resulting model is stored under the name
+ methods inherited from
The distribution includes two example applications:
train. These applications use the TextCategorizer module and
TextCategorizerManager module, respectively. For example,
may be invoked as follows
categorize -l "This is an example of English"
and will then load the default model store and simply print, on standard out
Closest match: english
To figure out what models (of, in this case, languages) are supported by the current model store, you say:
which will print, on standard out, the list
Available models: danish dutch english estonian finnish french german hungarian icelandic italian norwegian polish portuguese spanish swedish turkish
You may use
train to create new models. For example, the invocation
train --directory=shortTexts --out=mymodels.ozp
will create new models for the text files in the directory
and add them to
mymodels.ozp. (By the way, I have borrowed these
language samples from van Noord's distribution.) The program will consider each
file of the form
<name>.txt (no periods are allowed in
and the corresponding model will be named
<name>. If there
is already a model in
mymodels.ozp with that name, it will be replaced.
--out may point to the same file. If
is not specified, a new model will be created. If
--out is not
specified, the store will be saved in