A Metric Index for Approximate String Matching
Gonzalo Navarro and Edgar Chávez
We present a radically new indexing approach for approximate string matching.
The scheme uses the metric properties of the edit distance and can be applied
to any other metric between strings. We build a metric space where the sites
are the nodes of the suffix tree of the text, and the approximate query is seen
as a proximity query on that metric space. This permits us finding the occ
occurrences of a pattern of length m, permitting up to r differences, in a
text of length n over an alphabet of size s, in average time
O(m^(1+e) + occ) for any e>0, if r = o(m/log_s(m))
and m > ((1+e)/e)log_s(n). The index works well up to
r < (3-sqrt(2))m/log_s(m), where it achieves its maximum average
search complexity O(m^(1+sqrt(2)+e)+occ). The construction time of
the index is O(m^(1+sqrt(2)+e) n log n) and its space is
O(m^(1+sqrt(2)+e) n). This is the first index achieving average
search time polynomial in m and independent of n, for r = O(m/log_s(m)). Previous methods achieve this complexity only for r=O(m/log_s(n)).
We also present a simpler scheme needing O(n) space.