Semi-Indexes and Compressed Indexes based on Text Sampling
Francisco Claude, Gonzalo Navarro, Hannu Peltola, Leena Salmela, and
Jorma Tarhio
We introduce a novel alphabet sampling technique for speeding up both online
and indexed string matching. We choose a subset of the alphabet and extract
the corresponding subsequence of the text. Online or indexed searching is then
carried out on the extracted subsequence, and candidate matches are verified
in the full text. We show that this speeds up online searching, especially for
moderate to long patterns, by a factor of up to 5, while using 14% extra space
in our experiments. For indexed searching we achieve indexes that are as fast
as the classical suffix array, yet occupy less than 50% extra space (instead
of
the usual 400%). Our experiments show no competitive alternatives exist in a
wide space/time range.