Text Categorizer

Torbjörn Lager

provides
x-ozlib://lager/text-categorizer/TextCategorizer.ozf
x-ozlib://lager/text-categorizer/TextCategorizerManager.ozf
x-ozlib://lager/text-categorizer/categorize.exe
x-ozlib://lager/text-categorizer/train.exe
 

Purpose

This is an implementation in pure Oz of the text categorization method described in

Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Gertjan van Noord's implementation in Perl, available from http://odur.let.rug.nl/~vannoord/TextCat/ provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.

There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.

Installation

Download the package, and invoke ozmake in a shell as follows:

ozmake --install --package=lager-text-categorizer.pkg

By default, all files of the package are installed in the user's ~/.oz directory tree. In particular, all modules are installed in the user's private cache.

Usage

TextCategorizer.ozf

Methods

init(File)
An initialization method which loads a model store from File.
rank(String ?Ranking)
Ranking gets bound to a list of pairs of the form ModelName#Distance, where Distance is an integer representing the distance between the model ModelName and the model of String. The list is sorted in order of increasing distance.
categorize(String ?ModelName)
ModelName gets bound to the name of the model closest to the model of String.
models(?ModelNames)
Binds ModelNames to the list of names of stored models.

TextCategorizerManager.ozf

Methods

init(File<=new)
An initialization method which loads a model store from File, or (by default) starts from a new model.
addModel(ModelName String)
Adds a model of String to the model store under the name ModelName.
addModelFromFile(ModelName File)
Adds a model of the contents of File to the model store under the name ModelName.
addModelsFromDir(Dir)
Adds, for each file F in directory Dir, a model of the contents of F to the current model store. To be considered, the name of F must have the form <name>.txt (<name> must not contain any period) and the resulting model is stored under the name <name>.
saveModels(File)
Saves the current model store to File.

+ methods inherited from TextCategorizer

Example Applications

The distribution includes two example applications: categorize and train. These applications use the TextCategorizer module and TextCategorizerManager module, respectively. For example, categorize may be invoked as follows

categorize -l "This is an example of English"

and will then load the default model store and simply print, on standard out

Closest match: english

To figure out what models (of, in this case, languages) are supported by the current model store, you say:

categorize -c

which will print, on standard out, the list

Available models: danish dutch english estonian finnish french german hungarian icelandic italian norwegian polish portuguese spanish swedish turkish

You may use train to create new models. For example, the invocation

train --directory=shortTexts --out=mymodels.ozp

will create new models for the text files in the directory shortTexts, and add them to mymodels.ozp. (By the way, I have borrowed these language samples from van Noord's distribution.) The program will consider each file of the form <name>.txt (no periods are allowed in <name>) and the corresponding model will be named <name>. If there is already a model in mymodels.ozp with that name, it will be replaced. --in and --out may point to the same file. If --in is not specified, a new model will be created. If --out is not specified, the store will be saved in default.ozp.

 


Torbjörn Lager