A Metric Index for Approximate String Matching

Gonzalo Navarro and Edgar Chávez

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the occ occurrences of a pattern of length m, permitting up to r differences, in a text of length n over an alphabet of size s, in average time O(m^(1+e) + occ) for any e>0, if r = o(m/log_s(m)) and m > ((1+e)/e)log_s(n). The index works well up to r < (3-sqrt(2))m/log_s(m), where it achieves its maximum average search complexity O(m^(1+sqrt(2)+e)+occ). The construction time of the index is O(m^(1+sqrt(2)+e) n log n) and its space is O(m^(1+sqrt(2)+e) n). This is the first index achieving average search time polynomial in m and independent of n, for r = O(m/log_s(m)). Previous methods achieve this complexity only for r=O(m/log_s(n)). We also present a simpler scheme needing O(n) space.