The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4 n Hk(T) + o(n log s) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice the average locating complexity of the LZ-index is O(s m log_s n + occ s^(m/2)), where occ is the number of occurrences of P. It can extract text substrings of length l in O(l) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use n Hk(T) + o(n log s) bits in theory), and it does not offer space/time tuning options. This limits its applicability.
In this paper we study practical ways to reduce the space of the LZ-index.
We obtain new LZ-index variants that require
2(1+e) n Hk(T) + o(n log s) bits of space, for any 0
We perform extensive experimentation and conclude that our schemes are able
to reduce the space of the original LZ-index by a factor of 2/3, that is,
around 3 times the compressed text size.
Our schemes are able to extract about 1-2 megabytes of the text
per second, being twice as fast as the most competitive alternatives.
Pattern occurrences are located at a rate of up to 1-4 million per second.
This constitutes the best space/time trade-off
when indices are allowed to use 4 times the size of the compressed text or
more.