Indexing Text with Approximate q-grams
Gonzalo Navarro, Erkki Sutinen and Jorma Tarhio
We present a new index for approximate string matching. The index collects
text q-samples, that is, disjoint text substrings of length q, at fixed
intervals and stores their positions. At search time, part of the text is
filtered out by noticing that any occurrence of the pattern must be
reflected in the presence of some text q-samples that match approximately
inside the pattern. Hence the index points out the text areas that could
contain occurrences and must be verified. The index parameters permit load
balancing between filtering and verification work, and provide a compromise
between the space requirement of the index and the error level for which the
filtration is still efficient. We show experimentally that the index is
competitive against others that take more space, being in fact the fastest
choice for intermediate error levels, an area where no current index is useful.