Distributed Text Search using Suffix Arrays
Diego Arroyuelo, Carolina Bonacic, Veronica Gil-Costa, Mauricio Marin,
and Gonzalo Navarro
Text search is a classical problem in Computer Science, with many
data-intensive applications. For this problem, suffix arrays are among
the most widely known and used data structures, enabling fast searches for
phrases, terms, substrings and regular expressions in large texts. Potential
application domains for these operations include large-scale search services,
such as Web search engines, where it is necessary to efficiently process
intensive-traffic streams of on-line queries. This paper proposes strategies
to enable such services by means of suffix arrays. We introduce techniques for
deploying suffix arrays on clusters of distributed-memory processors and then
study the processing of multiple queries on the distributed data structure.
Even though the cost of individual search operations in sequential
(non-distributed) suffix arrays is low in practice, the problem of processing
multiple queries on distributed-memory systems, so that hardware resources are
used efficiently, is relevant to services aimed at achieving high query
throughput at low operational costs. Our theoretical and experimental
performance studies show that our proposals are suitable solutions for
building efficient and scalable on-line search services based on suffix
arrays.