Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same
time offering indexed search functionalities. As this new technology
permeated through applications like bioinformatics, the string collections
experienced a growth that outperforms Moore's Law and challenges our ability
to handle them even in compressed form. It turns out, fortunately, that
many of these rapidly growing string collections are highly repetitive,
so that their information content is orders of magnitude lower than their
plain size. The statistical compression methods used for classical collections,
however, are blind to this repetitiveness, and therefore a new set of techniques
developed in order to properly exploit it. The resulting indexes form a new
generation of data structures able to handle the huge repetitive string
collections that we are facing.
In this survey, formed by two parts, we cover the algorithmic developments that
have led to these data structures.
In this first part, we describe the distinct compression paradigms that have
been used to exploit repetitiveness, and the algorithmic techniques that
provide direct access to the compressed strings. In the quest for an ideal
measure of repetitiveness, we uncover a fascinating web of relations between
those measures, as well as the limits up to which the data can be recovered,
and up to which direct access to the compressed data can be provided. This is
the basic aspect of
indexability, which is covered in the second part of this survey.