Improving an Algorithm for Approximate String Matching
Gonzalo Navarro and Ricardo Baeza-Yates
We study a recent algorithm for fast on-line approximate string
matching. This is the problem of searching a pattern in a text allowing errors
in the pattern or in the text.
The algorithm is based on a very fast kernel which is able to search
short patterns using a non-deterministic finite automaton, which
is simulated using bit-parallelism. A number of techniques to extend this
kernel for longer patterns are presented in that work.
However, the techniques can be integrated in many ways and the optimal
interplay among them is by no means obvious.
The solution to this problem starts at a very low level, by obtaining basic
probabilistic information about the problem which was not previously known,
and ends integrating analytical results with empirical data to obtain
the optimal heuristic.
The conclusions obtained via analysis are experimentally confirmed.
We also improve many of the techniques and obtain a combined
heuristic which is faster than the original work.
This work shows an excellent example of a complex and theoretical analysis
of algorithms used for design and for practical algorithm engineering,
instead of the common practice of first designing an algorithm and
then analyzing it.