###
Compressed Representations of Sequences and Full-Text Indexes

####
Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro

Given a sequence *S = s(1)s(2) ... s(n)* of integers smaller than *r
= O(polylog(n))*, we show how *S* can be represented using
*nH0(S) + o(n)* bits, so that we can know any *s(i)*, as well
as answer *rank* and *select* queries on *S*, in constant
time. *H0(S)* is the zero-order empirical entropy of *S* and
*nH0(S)* provides an Information Theoretic lower bound to the
bit storage of any sequence *S* via a fixed encoding of its
symbols. This extends previous results on binary sequences, and
improves previous results on general sequences where those queries
are answered in *O(log r)* time. For larger *r*, we can still
represent *S* in *nH0(S) + o(n log r)* bits and answer queries
in *O(log r / log log n)* time.
Another contribution of this paper is to show how to combine our
compressed representation of integer sequences with an existing
compression boosting technique to design *compressed full-text
indexes* that scale well with the size of the input alphabet *A*.
Namely, we design a variant of the FM-index that indexes a string
*T[1,n]* within *nHk(T) + o(n)* bits of storage, where
*Hk(T)* is the *k*-th order empirical entropy of *T*. This
space bound holds simultaneously for all *k <= a log_|A| n*, constant
*0< a < 1*, and *|A| = O(polylog(n))*. This
index counts the occurrences of an arbitrary pattern *P[1,p]* as a
substring of *T* in *O(p)* time; it locates each pattern
occurrence in *O(log^(1+e) n)* time, for any constant
*0< e<1*; and it reports a text substring of length
*l* in *O(l + log^(1+e) n)* time.

Compared to all previous works, our index is the first one that
removes the alphabet-size dependance from all query times, in
particular counting time is linear in the pattern length. Still,
our index uses essentially the same space of the *k*-th order
entropy of the text *T*, which is the best space obtained in
previous work. We can also handle larger alphabets of size
*|A|=O(n^b)*, for any *0< b<1*, by paying
*o(n log |A|)* extra space and by multiplying all query times by
*O(log |A| / log log n)*.