Approximate String Matching with Compressed Indexes.
Luís Russo, Gonzalo Navarro, Arlindo Oliveira, and Pedro Morales.
A compressed full-text self-index for a text T is a data
structure requiring reduced space and able to search for
patterns P in T. It can also reproduce any
substring of T, thus actually replacing T. Despite the recent
explosion of interest on compressed indexes, there has not
been much progress on functionalities beyond the basic exact
search. In this paper we focus on indexed approximate string
matching (ASM), which is of great interest, say, in bioinformatics.
We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed
suffix trees/arrays. Most compressed self-indexes belong to one
of these classes. We start by adapting the classical method of
partitioning into exact search to self-indexes, and
optimize it over a representative of either class of
self-index. Then, we show that a Lempel-Ziv index can be seen as an
extension of the classical q-samples index. We give new insights
on this type of index, which can be of independent interest, and then
apply them to a Lempel-Ziv index. Finally, we improve hierarchical
verification, a successful technique for sequential searching, so as to
extend the matches of pattern pieces to the left or right. Most compressed
suffix trees/arrays support the required bidirectionality, thus enabling the
implementation of the improved technique. In turn, the improved verification
largely reduces the accesses to the text, which are expensive in
self-indexes. We show experimentally that our algorithms are competitive
and provide useful space-time tradeoffs compared to classical indexes.