Compressed q-gram Indexing for Highly Repetitive Biological Sequences
Francisco Claude, Antonio Fariña, Miguel Martínez-Prieto,
and Gonzalo Navarro
The study of compressed storage schemes for highly repetitive sequence
collections has been recently boosted by the availability of cheaper
sequencing technologies and the flood of data they promise to generate.
Such a storage scheme may range from the simple goal of retrieving whole
individual sequences to the more advanced one of providing fast searches in
the collection. In this paper we study alternatives to implement a
particularly
popular index, namely, the one able to find all the positions in the
collection of substrings of fixed length (q-grams). We introduce two
novel techniques and show they constitute practical alternatives to handle
this scenario. They excell particularly in two cases: when q is small (up
to 6), and when the collection is extremely repetitive (less than
0.01% mutations).