Combining Text Compression and String Matching: The Miracle of Self-Indexing
Gonzalo Navarro
This decade has witnessed the raise of what I consider the most important
breakthrough of modern times in text compression and indexed string matching.
Self-indexing is the mechanism by which a text is simultaneously
compressed and indexed, so that the self-index occupies space close to that
of the compressed text, provides random access to any part of it, and in
addition supports efficient indexed pattern matching. Thus a self-index can
replace the text by a compressed version with enhanced search functionalities.
Self-indexing builds on a large base of compressed data structures,
which is another fascinating algorithmic area that has appeared two decades
ago with the aim of obtaining compact representations of classical data
structures. Although they usually require more instructions than their
classical counterparts to operate, they can benefit from the memory hierarchy.
This is particularly noticeable when they can operate in main memory in cases
where the classical structures require disk storage.
My aim in this talk is to present a thin ``vertical'' slice of this
construction,
so that there is time to visualize in sufficent detail a complete solution
from
the basics to the final result. I will start with a plain and a compressed
solution to provide rank on bitmaps, a simple operation of counting the
number of 1s up to a given position, with a surprising number of applications.
I will then introduce wavelet trees, which constitute a sort of
self-index
for sequences, supporting operation rank for the alphabet symbols. Then
I
will explain the Burrows-Wheeler Transform and the FM-index concept, which
coupled with wavelet trees offer a fully-functional self-index. Finally, I
will show how this simple combination is able of achieving high-order
compression of a text, and will give some insights on recent work around
indexing highly repetitive sequence collections, such as DNA and protein
databases, versioned data, and temporal text databases. I will conclude by
posing some open challenges.