Fast Multi-Dimensional Approximate Pattern Matching

Gonzalo Navarro and Ricardo Baeza-Yates

We address the problem of approximate string matching in d dimensions, that is, to find a pattern of size m^d in a text of size n^d with at most k < m^d errors (substitutions, insertions and deletions along any dimension). We use a novel and very flexible error model, for which there exists only an algorithm to evaluate the similarity between two elements in two dimensions at O(m^4) time. We extend the algorithm to d dimensions, at O(d!m^(2d)) time and O(d!m^(2d-1)) space. We also give the first search algorithm for such model, which is O(d!m^dn^d) time and O(d!m^dn^(d-1)) space. We show how to reduce the space cost to O(d!3^dm^(2d-1)) with little time penalty. Finally, we present the first sublinear-time (on average) searching algorithm (i.e. not all text cells are inspected), which is O(kn^d/m^(d-1)) for k < (m/(d(\log(base s) m - \log(base s) d)))^(d-1) where s is the alphabet size. After that error level the filter still remains better than dynamic programming for k <= m^(d-1)/(d(\log(base s) m - \log(base s) d))^((d-1)/d) These are the first search algorithms for the problem. As side-effects we extend to d dimensions an already proposed algorithm for two-dimensional exact string matching, and we obtain a sublinear-time filter to search in d dimensions allowing k mismatches.