Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Gonzalo Navarro and Víctor Sepúlveda

We introduce a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ). On a collection of total size n compressed into z phrases from a reference string R[1..r] over alphabet [1..s] and with hth order empirical entropy H_h(R), our index uses rH_h(R) + o(r log s) + O(r+z log n) bits, and finds the occ occurrences of a pattern P[1..m] in time O((m+occ) log n). This is competitive with the only existing index based on RLZ, yet it is much simpler and easier to implement. On a 1GB collection of 80 yeast genomes, a variant of our index achieves the least space among competing structures (slightly over 0.1 bits per base) while outperforming or matching them in time (1-20 microseconds per occurrence found). Our largest variant (below 0.4 bits per base) offers the best search time (1-3 microseconds per occurrence) among all structures using space below 1 bit per base.