###
Simulating the DNA Overlap Graph in Succinct Space

####
Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro

Converting a set of sequencing reads into a lossless compact data structure
that encodes all the relevant biological information is a major challenge. The
classical approaches are to build the string graph or the *de Bruijn*
graph. Each has advantages over the other depending on the application. Still,
the ideal setting would be to have an index of the reads that is easy to build
and can be adapted to any type of biological analysis. In this paper we
propose a new data structure we call *rBOSS*, which gets close to that
ideal. Our *rBOSS* is a de Bruijn graph in practice, but it simulates
any length up to *k* and can compute overlaps of size at least
*m* between the
labels of the nodes, with *k* and *m* being parameters. If we choose the
parameter *k* equal to the size of the reads, then we can simulate a complete
string graph. As most BWT-based structures, *rBOSS* is unidirectional,
but it exploits the property of the DNA reverse complements to simulate
bi-directionality with some time-space trade-offs. We implemented a genome
assembler on top of *rBOSS* to demonstrate its usefulness. Our
experimental results show that, using *k*=100, *rBOSS* can assemble 185
MB of reads in less than 15 minutes and using 110 MB in total. It produces
contigs of mean sizes over 10,000, which is twice the size obtained by using a
pure de Bruijn graph of fixed length *k*.