Speeding Up Pattern Matching by Text Sampling
Francisco Claude, Gonzalo Navarro, Hannu Peltola, Leena Salmela, and
Jorma Tarhio
We introduce a novel alphabet sampling technique for speeding up both online
and indexed string matching. We choose a subset of the alphabet and select
the corresponding subsequence of the text. Online or indexed searching is then
carried out on that subsequence, and candidate matches are verified
in the full sequence. We show that this speeds up online searching, especially
for moderate to long patterns, by a factor of up to 5. For indexed searching
we achieve indexes that are as fast as the classical suffix array, yet occupy
less than 0.5 times the text size (instead of 4) plus text. Our experiments
show no competitive alternatives in a wide space/time range. The success and
simplicity of our approach owes in part to our analytical results, which solve
the challenging combinatorial problem of choosing the optimal alphabet subset.