Storage and Retrieval of Individual Genomes
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko
Välimäki
A repetitive sequence collection is one where portions of a base
sequence of length n are repeated many times with small variations,
forming a collection of total length N. Examples of such collections are
version control data and genome sequences of individuals, where the differences
can be expressed by lists of basic edit operations. Flexible and efficient data
analysis on a such typically huge collection is plausible using suffix trees.
However, suffix tree occupies O(N log N) bits, which very soon inhibits
in-memory analyses. Recent advances in full-text self-indexing reduce
the space of suffix tree to O(N log s) bits, where s is the
alphabet size. In practice, the space reduction is more than 10-fold, for
example on suffix tree of Human Genome. However, this reduction factor remains
constant when more sequences are added to the collection.
We develop a new family of self-indexes suited for the repetitive sequence
collection setting. Their expected space requirement depends only on the length
n of the base sequence and the number S of variations in its
repeated copies. That is, the space reduction factor is no longer constant, but
depends on N/n.
We believe the structures developed in this work will provide a fundamental
basis for storage and retrieval of individual genomes as they become available
due to rapid progress in the sequencing technologies.