Distributed Generation of Suffix Arrays: a Quicksort-Based Approach

João Paulo Kitajima, Gonzalo Navarro, Berthier Ribeiro and Nivio Ziviani

An algorithm for the distributed computation of suffix arrays for large texts is presented. The parallelism model is that of a set of sequential tasks which execute in parallel and exchange messages between each other. The underlying architecture is that of a high-bandwidth network of processors. In such a network, a remote memory access has a transfer time similar to the transfer time of magnetic disks (with no seek cost) which allows to use the aggregate memory distributed over the various processors as a giant cache for disks. Our algorithm takes advantage of this architectural feature to implement a quicksort-based distributed sorting procedure for building the suffix array. We show that such algorithm has computation complexity given by O (r log (n/r)+n/r log r log n) on average and communication complexity given by O (n/r log^2 r) in the worst case and O (n/r log r) on average, where n is the text size and r is the number of processors. This is considerably faster than the best known sequential algorithm for building suffix arrays which has disk time complexity given by O (n^2/m) where m is the size of the main memory. In the worst case this algorithm is the best among the parallel algorithms we are aware of. Furthermore, our algorithm scales up nicer in the worst case than the others.