Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence
Collections
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko
Välimäki
A repetitive sequence collection is one where portions of a base
sequence of length n are repeated many times with small variations,
forming a collection of total length N. Examples of such collections
are version control data and genome sequences of individuals, where
the differences can be expressed by lists of basic edit operations.
This paper is devoted to studying ways to store massive sets of highly
repetitive sequence collections
in space-efficient manner so that retrieval of the content as well as queries
on the content of the sequences can be provided time-efficiently.
We show that the state-of-the-art entropy-bound full-text
self-indexes do not yet provide satisfactory space bounds for this
specific task. We engineer some new structures that use run-length encoding
and give empirical evidence that these structures are superior to the current
structures.