Storage and Retrieval of Highly Repetitive Sequence Collections
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko
Välimäki
A repetitive sequence collection is a set of sequences which are small
variations of each other.
A prominent example are genome sequences of individuals of
the same or close species, where
the differences can be expressed by short lists of basic edit operations.
Flexible and efficient data analysis on such a typically huge collection is
plausible using
suffix trees. However, the suffix tree occupies much space,
which very soon inhibits
in-memory analyses. Recent advances in full-text indexing reduce the space of
the suffix tree to, essentially, that of the compressed sequences, while retaining its
functionality with only a polylogarithmic slowdown.
However, the underlying compression model considers only the predictability of
the next sequence symbol given the k previous ones, where k is a small
integer. This is unable to capture longer-term repetitiveness.
For example, r identical copies of an incompressible sequence will be
incompressible under this model.
We develop new static and dynamic full-text indexes
that are able of capturing the fact that a collection is highly repetitive,
and require space basically proportional to the
length of one typical sequence plus the total number of edit operations.
The new indexes can be plugged into a recent
dynamic fully-compressed suffix tree,
achieving full functionality for sequence analysis, while retaining the
reduced space and the polylogarithmic slowdown.
Our experimental results confirm the practicality of our proposal.