Practical Indexing of Repetitive Collections using Relative Lempel-Ziv
Gonzalo Navarro and Víctor Sepúlveda
We introduce a simple and implementable compressed index for highly repetitive
sequence collections based on Relative Lempel-Ziv (RLZ). On a collection of
total size n compressed into z phrases from a reference string
R[1..r] over alphabet [1..s] and with hth order empirical
entropy H_h(R), our index uses rH_h(R) + o(r log s) + O(r+z log n)
bits, and finds the occ
occurrences of a pattern P[1..m] in time O((m+occ) log n). This is
competitive with the only existing index based on RLZ, yet it is much simpler
and easier to implement. On a 1GB collection of 80 yeast genomes, a variant of
our index achieves the least space among competing structures (slightly over 0.1
bits per base) while outperforming or matching them in time (1-20 microseconds
per occurrence found). Our largest variant (below 0.4 bits per base) offers the
best search time (1-3 microseconds per occurrence) among all structures using
space below 1 bit per base.