Go to the main page

Parallel Construction of Wavelet Trees on Multicore Architectures

José Fuentes-Sepúlveda, Erick Elejalde, Leo Ferres, and Diego Seco


After their introduction in the mid-2000s, multicore computers — computers with more than one processing unit (called cores) and shared memory — have become pervasive. In fact, it is hard nowadays to find a single-processor desktop, let alone a high-end server. The argument for multicore systems is simple: thermodynamic and material considerations prevent chip manufacturers from ever increasing clock frequencies. Since 2005, clock frequencies have stagnated at around 3.75GHz for commodity computers, and even in 2015, 4GHz computers are still high-end (Intel Core i7-4790K is a prime example). Thus, one possible next step in performance is to take advantage of multicore computers. To do this, algorithms and data structures will have to be modified to make them behave well in parallel architectures.

Together with higher-speed processing, in the past few years, much has been written about succinct data structures. This is, in part, due to the fact that data to be processed has become large enough such that the ability to maintain it close to the processor is vital. One such structure that has benefited from thorough research is the wavelet tree. Although the wtree was original devised as a data structure for encoding a reordering of the elements of a sequence, it has been successfully used in many applications. For example, it has been used to index documents, grids and even sets of rectangles, to name a few applications.

A problem that frequently arises is that these succinct data structures are quite expensive (in time) to build, particularly as the size of the alphabet and the size of the input increase, as is the case now with the so-called “big data” revolution. Thus, sizeable contributions to the state-of-the-art would involve algorithms with good theoretical running times that are also practical in mod- ern commodity architectures with more than one core. Unfortunately, (prac- tical) parallel computing suffers from several drawbacks that make these high- performant algorithms hard to come by: on the one hand, maintaing thread independence, and on the other keeping clear of “thrashing” the memory hi- erarchy. If such algorithms are found, however, a positive side-effect is that would also help speed up processing of the target data-structures in distributed systems: if one node is faster, the whole cluster would be faster as well.

In this work we propose two parallel algorithms for the most expensive operation on wtrees, which is its construction. The first algorithm, pwt, has O ( n ) time complexity and uses O ( n lg σ + σ lg n ) bits of space, including the space of the final wtree and excluding the input. The second algorithm, dd, has O ( lg n ) time complexity and uses O ( n lg σ + p σ lg n lg σ ) bits of space, using p threads. The pwt algorithm improves the O ( n ) memory consumption of [1]. Meanwhile, the dd algorithm improves the O ( n ) time complexity of our previous work [2] and the time complexity of [1] by a factor of O ( lg σ ) . Additionally, we report experiments which show the algorithms to be practical for large datasets, achieving good speedup.

PDF document

An author's version of our work can be downloaded here. The link to the official version is here.

The Code

The code for this work is available at github.


To evaluate the performance of our algorithms, we compared them against libcds and sdsl, which are state-of-the-art on sequential implementations of the Wavelet trees. We also compared our algorithms against the parallel algorithm introduced in [1] (We thank Julian Shun for providing the code of [1]). We implemented the our algorithm in C and compiled it using GCC 4.9 with optimization level -O2. All parallel code was compiled using the GCC Cilk branch. The same flags were used to compile libcds, sdsl and Shun's code, which were written in C++. We tested our algorithm on four eighteen: The datasets are shown in detail in the next table:
Dataset n σ
rna.512MB 536,870,912 4
rna.1GB 1,073,741,824 4
rna.2GB 2,147,483,648 4
rna.3GB 3,221,225,472 4
rna.4GB 4,294,967,296 4
rna.5GB 5,368,709,120 4
rna.6GB 6,442,450,944 4
rna.13GB 14,570,010,837 4
prot 1,184,051,855 27
Dataset n σ
src.200MB 210,866,607 230
src.98MB 25,910,717 2,446,383
src.512MB 134,217,728 2,446,383
src.1GB 268,435,455 2,446,383
src.2GB 536,870,911 2,446,383
en.x.27 134,217,728 2x
en.x.28 268,435,456 2x
en.x.29 536,870,912 2x
en.x.30 1,073,741,824 2x

The experiments were carried out on a 4-chip (8 NUMA nodes) AMD Opteron 6278 machine with 8 physical cores per NUMA node, clocking at 2.4GHz each, with one 64KB L1 instruction cache shared by two cores, one 16KB L1 data cache per core, a 2MB L2 cache shared between two cores, and a 6MB of L3 shared between 8 cores per NUMA node. The machine had 192GB of DDR3 RAM, clocking at 1333MHz, with 24GB per NUMA node. Running times were measured using the high-resolution (nanosecond) C functions in time.h. Memory usage was measured using the tools provided by malloc_count. We repeated each trial five times and recorded the median time.


Go to the main page

This work was supported the doctoral scholarship of CONICYT (21120974) and in part by the Emerging Leaders in the Americas scholarship programme

Contact us at: jfuentess@udec.cl
We would like to thank to Free DNS and HighCharts for their services