All Packages
All Categories
By Author

ap (3)
cp (3)
dp (3)
exe (3)
gui (0)
gui/gtk (0)
gui/tk (4)
io (1)
lib (11)
math (0)
net (9)
nlp (18)
op (4)
os (2)
program (3)
sp (2)
tool (9)
wp (2)
xml (2)

Text Categorizer

blurb:An N-gram-based text categorizer/language recognizer
author:Torbjörn Lager
provides:[nlp] x-ozlib://lager/text-categorizer/TextCategorizer.ozf
[nlp] x-ozlib://lager/text-categorizer/TextCategorizerManager.ozf
[nlp] x-ozlib://lager/text-categorizer/categorize.exe
[nlp] x-ozlib://lager/text-categorizer/train.exe

This is an implementation in pure Oz of the text categorization method described in

Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Gertjan van Noord's implementation in Perl, available from provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.

There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.