Distributed Generation of Suffix Arrays: a Quicksort-Based Approach
João Paulo Kitajima, Gonzalo Navarro, Berthier Ribeiro and Nivio Ziviani
An algorithm for the distributed computation of suffix arrays for large
texts is presented. The parallelism model is that of a set of
sequential tasks which execute in parallel and exchange messages
between each other. The underlying architecture is that of a
high-bandwidth network of processors. In such a network, a remote
memory access has a transfer time similar to the transfer time of
magnetic disks (with no seek cost) which allows to use the aggregate
memory distributed over the various processors as a giant cache for
disks. Our algorithm takes advantage of this architectural feature to
implement a quicksort-based distributed sorting procedure for building
the suffix array. We show that such algorithm has
computation complexity given by
O (r log (n/r)+n/r log r log n)
on average and communication complexity
given by O (n/r log^2 r)
in the worst case and O (n/r log r) on average,
where n is the text size and r is the number of processors.
This is
considerably faster than the best known sequential algorithm for
building suffix arrays which has disk time complexity given by O (n^2/m)
where m is the size of the main memory. In the worst case
this algorithm is the best among the parallel algorithms we are aware of.
Furthermore, our algorithm scales up nicer in the worst case than the others.