New Models and Algorithms for Multidimensional Approximate Pattern Matching

Ricardo Baeza-Yates and Gonzalo Navarro

We focus on how to compute the edit distance (or similarity) between two images and the problem of approximate string matching in two dimensions, that is, to find a pattern of size m x m in a text of size n x n with at most k errors (character substitutions, insertions and deletions). Pattern and text are matrices over an alphabet of size sigma. We present new models and give the first sublinear time search algorithms for the new and the existing models.

The only existing model just considers errors along one dimension. The associated approximate search algorithms use dynamic programming and are relatively expensive (O(m^2*n^2) or O(k^2*n^2)). For this model we present a filtering algorithm which avoids verifying most of the text with dynamic programming. This filter is based on one-dimensional multipattern approximate searching. The average complexity of our resulting algorithm is O(n^2*k*log_sigma(m)/m^2) for k < m*(m+1)/(5*log_sigma(m)), which is optimal and matches the best previous result that allows only substitutions. We present other slower filtration algorithms that however work for higher error levels.

We then consider more general models to compare images. We present new similarity measures and the algorithms to compute them. We then focus on one of the models, which allows the errors to occur along any dimension, and extend it to the general case where pattern and text are d-dimensional. This edit distance can be computed in O(d!m^(2d)) time and O(d!m^(2d-1)) space. We also present the first sublinear-time (on average) searching algorithm (i.e. not all text cells are inspected), which is O(k*n^d/m^(d-1)) time for k < (m/(d(log_sigma(m/d))))^(d-1).