Efficient Construction of the BWT for Repetitive Text using String Compression
Diego Díaz-Domínguez and Gonzalo Navarro
We present a new semi-external linear-time algorithm for building the Burrows-Wheeler transform
(BWT) of a string collection. Our method uses compression techniques to reduce the computational
costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS)
and resort to run-length and grammar compression to maintain our intermediate results in compact
form. Our compression format not only saves space, but it also speeds up the required computations.
Our experiments show important savings in both space and computation time when the text is
repetitive. On average, we are 3.69x faster than the baseline approach, while maintaining a similar
memory consumption. These results make our method stand out as the only one (to our knowledge)
that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and
using only 27 GB of working memory.