Stronger Lempel-Ziv Based Compressed Text Indexing

Diego Arroyuelo, Gonzalo Navarro, and Kunihiko Sadakane

Given a text T[1..u] over an alphabet of size s, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space.

The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4 u Hk(T) + o(u log s) bits of space, where Hk(T) denotes the k-th order empirical entropy of T, for any k = o(log_s u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m^3 log s + (m+occ) log u) worst-case time. Although this index has proven very competitive in practice, the O(m^3 log s) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives.

In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+e) u Hk(T) + o(u log s) bits of space, for any constant e > 0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m^2 + (m+occ) log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length l in optimal O(l/log_s u) time. In addition, we show how the space can be squeezed to (1+e) u Hk(T) + o(u log s) to obtain a structure with O(m^2) average search time for m >= 2 log_s u. Alternatively, the search time of LZ-indices can be improved to O((m+occ) log u) with (3+e) u Hk(T) + o(u log s) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.