Levenshtein

Torbjörn Lager

provides
x-ozlib://lager/levenshtein/Levenshtein.so{native}
x-ozlib://lager/levenshtein/Levenshtein.ozf
x-ozlib://lager/levenshtein/match.exe

Purpose

The modules in this package export functions which measure the so called edit distance (also called Levenshtein distance) between two strings, a source and a target. The edit distance is defined as the number of deletions, insertions, or substitutions required to transform the source into the target. The greater the distance, the more different the strings are, and vice versa. Edit distance can be (and has been) used for spell checking and speech recognition purposes.

The distribution contains two functionally equivalent implementations, one in C linked into Oz, and one in pure Oz. They are both straightforward implementations of Levenshtein's algorithm - a dynamic programming algorithm capable of calculating the edit distance in time proportional to the length of the source times the length of the target. The C-based version is roughly eight times faster than the pure Oz version, and is therefore recommended for serious use.

Installation

Download the package, and invoke ozmake in a shell as follows:

ozmake --install --package=lager-levenshtein.pkg

By default, all files of the package are installed in the user's ~/.oz directory tree. In particular, all modules are installed in the user's private cache.

Usage

The C version:

import Levenshtein at 'x-ozlib://lager/levenshtein/Levenshtein.so{native}'      
 ... 
{Levenshtein.distance +S1 +S2 ?I} 

The pure Oz version:

import Levenshtein at 'x-ozlib://lager/levenshtein/Levenshtein.ozf'      
 ... 
{Levenshtein.distance +S1 +S2 ?I}  

Example

For example,

{Levenshtein.distance "Mozart" "Mogul" I}

binds I to 4, since three substitutions and one deletion is required to transform "Mozart" into "Mogul".

Example Application

The distribution also includes match, a stand-alone demo application which reads and tokenizes a text file and prints each token - if it is within a given edit distance to a given target word - to a file, or to stdout by default. It can be invoked in the following way:

bash$ ./match -i test.txt -t lovely -l 1 -h 2
closely
levels  
     
Execution time: 40 ms
bash$

Invoke match --help to see what options are available. Among other things, there is an option that will let you compare the speed of the two implementations.

Related Information

A thread discussing fast edit distance algorithms appeared on a the corpora mailing list a couple of years ago, and is summarized here:

http://www.hit.uib.no/corpora/2000-1/0021.html

Note that the C code used in the native functor described above appears at the very end of this document, and that it was contributed by Gertjan van Noord.


Torbjörn Lager