Advantages of Backward Searching -- Efficient Secondary Memory and
Distributed Implementation of Compressed Suffix Arrays
Veli Mäkinen, Gonzalo Navarro and Kunihiko Sadakane
One of the most relevant succinct suffix array proposals in the literature is
the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs
n(H_0+O(log log s)) bits of space, where n is the text size,
s is the alphabet size, and H_0 the zero-order entropy of the text.
The number of occurrences of a pattern of length m can be computed in
O(m log n) time. Most notably, the CSA does not need the text separately
available to operate. The CSA simulates a binary search over the suffix array,
where the query is compared against text substrings. These are extracted from
the same CSA by following irregular access patterns over the structure.
Sadakane [SODA 2002] has proposed using backward searching on the CSA
in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000]. He has
shown that the CSA can be searched in O(m) time whenever
s = O(polylog(n)).
In this paper we consider some other consequences of backward searching applied
to CSA. The most remarkable one is that we do not need, unlike all previous
proposals, any complicated sub-linear structures based on the four-Russians
technique (such as constant time rank and select queries on bit arrays).
We show that sampling and compression are enough to achieve O(m log n) query
time using less space than the original structure. It is also possible to trade
structure space for search time. Furthermore, the regular access pattern
of backward searching permits an efficient secondary memory implementation, so
that the search can be done with O(m log_B n) disk accesses, being B the
disk block size. Finally, it permits a distributed implementation with optimal
speedup and negligible communication effort.