Fast Multi-Dimensional Approximate Pattern Matching
Gonzalo Navarro and Ricardo Baeza-Yates
We address the problem of approximate string matching in d dimensions,
that is, to find a pattern of size m^d in a text of size n^d
with at most k < m^d errors (substitutions, insertions and deletions
along any dimension).
We use a novel and very flexible error model, for which there exists
only an algorithm to evaluate the similarity between two elements in two
dimensions at O(m^4) time.
We extend the algorithm to d dimensions, at O(d!m^(2d)) time and
O(d!m^(2d-1)) space.
We also give the first search algorithm for such model, which is
O(d!m^dn^d) time and O(d!m^dn^(d-1)) space. We show how to reduce
the space cost to O(d!3^dm^(2d-1)) with little time penalty.
Finally, we present the first sublinear-time (on average) searching algorithm
(i.e. not all text cells are inspected), which is O(kn^d/m^(d-1)) for
k < (m/(d(\log(base s) m - \log(base s) d)))^(d-1) where s is
the alphabet size.
After that error level the filter still
remains better than dynamic programming for
k <= m^(d-1)/(d(\log(base s) m - \log(base s) d))^((d-1)/d)
These are the first search algorithms for the problem. As side-effects
we extend to d dimensions an already proposed algorithm for two-dimensional
exact string matching, and we obtain a sublinear-time filter to search in d
dimensions allowing k mismatches.