On Compressing and Indexing Repetitive Sequences
Sebastian Kreft and Gonzalo Navarro
We introduce LZ-End, a new member of the Lempel-Ziv family of text
compressors, which achieves compression ratios close to those of LZ77 but
performs much faster at extracting arbitrary text substrings. We then
build the first self-index based on LZ77 (or LZ-End) compression, which in
addition to text extraction offers fast indexed searches on the compressed
text. This self-index is particularly effective to represent highly
repetitive sequence collections, which arise for example when storing
versioned documents, software repositories, periodic publications, and
biological sequence databases.