A Practical Index for Text Retrieval Allowing Errors
Ricardo Baeza-Yates and Gonzalo Navarro
We propose a text indexing technique for approximate pattern matching, which
is practical and especially aimed at Information Retrieval (IR).
Unlike other indices of this kind, it is able to retrieve any string that
approximately matches a given search pattern.
Every sequence of a fixed length appearing in the text is stored in the index,
together with pointers to all the positions where it appears.
The search pattern is cut into pieces so that at least one must match exactly.
All the pieces are searched in the index and the union of candidate positions
is verified.
To reduce space requirements, pointers to blocks instead of exact positions
can be used, which increases querying costs.
We design an algorithm to optimize the pattern partition into pieces so that
the total number of verifications is minimized.
This also allows to know in advance the expected cost of the search and the
expected relevance of the query to the user.
We show experimentally the build time, space requirements and query times of
our index, finding that it is a practical alternative for IR.