# Parallel Construction of Wavelet Trees on Multicore Architectures

José Fuentes-Sepúlveda, Erick Elejalde, Leo Ferres, and Diego Seco

## Introduction

After their introduction in the mid-2000s, multicore computers — computers with more than one processing unit (called cores) and shared memory — have become pervasive. In fact, it is hard nowadays to find a single-processor desktop, let alone a high-end server. The argument for multicore systems is simple: thermodynamic and material considerations prevent chip manufacturers from ever increasing clock frequencies. Since 2005, clock frequencies have stagnated at around 3.75GHz for commodity computers, and even in 2015, 4GHz computers are still high-end (Intel Core i7-4790K is a prime example). Thus, one possible next step in performance is to take advantage of multicore computers. To do this, algorithms and data structures will have to be modified to make them behave well in parallel architectures.

Together with higher-speed processing, in the past few years, much has been written about succinct data structures. This is, in part, due to the fact that data to be processed has become large enough such that the ability to maintain it close to the processor is vital. One such structure that has benefited from thorough research is the wavelet tree. Although the wtree was original devised as a data structure for encoding a reordering of the elements of a sequence, it has been successfully used in many applications. For example, it has been used to index documents, grids and even sets of rectangles, to name a few applications.

A problem that frequently arises is that these succinct data structures are quite expensive (in time) to build, particularly as the size of the alphabet and the size of the input increase, as is the case now with the so-called “big data” revolution. Thus, sizeable contributions to the state-of-the-art would involve algorithms with good theoretical running times that are also practical in mod- ern commodity architectures with more than one core. Unfortunately, (prac- tical) parallel computing suffers from several drawbacks that make these high- performant algorithms hard to come by: on the one hand, maintaing thread independence, and on the other keeping clear of “thrashing” the memory hi- erarchy. If such algorithms are found, however, a positive side-effect is that would also help speed up processing of the target data-structures in distributed systems: if one node is faster, the whole cluster would be faster as well.

In this work we propose two parallel algorithms for the most expensive operation on wtrees, which is its construction. The first algorithm, pwt, has $O\left(n\right)$ time complexity and uses $O\left(n⁣\mathrm{lg}\sigma +\sigma ⁣\mathrm{lg}n\right)$ bits of space, including the space of the final wtree and excluding the input. The second algorithm, dd, has $O\left(\mathrm{lg}n\right)$ time complexity and uses $O\left(n⁣\mathrm{lg}\sigma +\frac{p⁣\sigma ⁣\mathrm{lg}n}{\mathrm{lg}\sigma }\right)$ bits of space, using p threads. The pwt algorithm improves the $O\left(n\right)$ memory consumption of [1]. Meanwhile, the dd algorithm improves the $O\left(n\right)$ time complexity of our previous work [2] and the time complexity of [1] by a factor of $O\left(\mathrm{lg}\sigma \right)$. Additionally, we report experiments which show the algorithms to be practical for large datasets, achieving good speedup.

## PDF document

An author's version of our work can be downloaded here. The link to the official version is here.

## The Code

The code for this work is available at github.

## Experiments

To evaluate the performance of our algorithms, we compared them against libcds and sdsl, which are state-of-the-art on sequential implementations of the Wavelet trees. We also compared our algorithms against the parallel algorithm introduced in [1] (We thank Julian Shun for providing the code of [1]). We implemented the our algorithm in C and compiled it using GCC 4.9 with optimization level -O2. All parallel code was compiled using the GCC Cilk branch. The same flags were used to compile libcds, sdsl and Shun's code, which were written in C++. We tested our algorithm on four eighteen:
• rna: This dataset correspond to the GenBank mRNAs of the University of California, Santa Cruz (link). We encoded each symbol of the dataset using 1 byte. Additionally, We split the original dataset to generate the datasets rna.13GB, rna.6GB, rna.5GB, rna.4GB, rna.3GB, rna.2GB, rna.1GB and rna.512MB.
• prot: Protein sequences from the Pizza&Chili corpus. This dataset was encoded using 1 byte per symbol.
• src: Source code from the Pizza&Chili corpus. We create two versions of this dataset. The first version used 1 byte to encode each symbol, generating the dataset src.200MB. The second version used words as symbols, using 4 bytes per symbol. We concatenated this second version, generating the datasets src.98MB, src.512MB, src.1GB and src.2GB.
• en: To measure the impact of varying the alphabet size, we took the English corpus of the Pizza&Chili corpus as a sequence of words and filtered the number of different symbols in the dataset. The dataset had an initial alphabet $\Sigma$ of $\sigma =633,816$ symbols. For experimentation, we generated an alphabet ${\Sigma }^{\text{'}}$ of size ${2}^{x}$, taking the top ${2}^{x}$ most frequent words in the original $\Sigma$, and then assigning a random index to each symbol using a Marsenne Twister, with $x\in \left\{4,6,8,10,12,14\right\}$. To create an input sequence $S$ of $n$ symbols for the English dataset, we searched for each symbol in $\Sigma$ in the original English text and, when found, appended it to $S$ until it reached the maximum possible size given $\sigma$ ($~1.5GB$, in the case of $\sigma ={2}^{18}$), maintaining the order of the original English text. We then either split $S$ until we reached the target size $n={2}^{27}$ or concatenated $S$ with initial sub-sequences of itself to reach the larger sizes ${2}^{28}$, ${2}^{29}$ and ${2}^{30}$. This generated the datasets en.x.27, en.x.28, en.x.29 and en.x.30.
The datasets are shown in detail in the next table:
Dataset n σ
rna.512MB 536,870,912 4
rna.1GB 1,073,741,824 4
rna.2GB 2,147,483,648 4
rna.3GB 3,221,225,472 4
rna.4GB 4,294,967,296 4
rna.5GB 5,368,709,120 4
rna.6GB 6,442,450,944 4
rna.13GB 14,570,010,837 4
prot 1,184,051,855 27
Dataset n σ
src.200MB 210,866,607 230
src.98MB 25,910,717 2,446,383
src.512MB 134,217,728 2,446,383
src.1GB 268,435,455 2,446,383
src.2GB 536,870,911 2,446,383
en.x.27 134,217,728 ${2}^{x}$
en.x.28 268,435,456 ${2}^{x}$
en.x.29 536,870,912 ${2}^{x}$
en.x.30 1,073,741,824 ${2}^{x}$

The experiments were carried out on a 4-chip (8 NUMA nodes) AMD Opteron 6278 machine with 8 physical cores per NUMA node, clocking at 2.4GHz each, with one 64KB L1 instruction cache shared by two cores, one 16KB L1 data cache per core, a 2MB L2 cache shared between two cores, and a 6MB of L3 shared between 8 cores per NUMA node. The machine had 192GB of DDR3 RAM, clocking at 1333MHz, with 24GB per NUMA node. Running times were measured using the high-resolution (nanosecond) C functions in time.h. Memory usage was measured using the tools provided by malloc_count. We repeated each trial five times and recorded the median time.

### Results

• Running times, in seconds, of the sequential algorithms and parallel algorithms with 1 and 64 threads. The best sequential times are underlined and the best parallel times are shown using bold typeface. A “-” is shown for implementations that just work for $n<{2}^{\mathrm{32}}$.

Datasets libcds sdsl pwt dd shun
1 64 1 64 1 64
rna.512MB 23.42 32.41 11.83 7.00 12.65 0.40 12.63 0.67
rna.1GB 47.38 65.30 23.89 16.19 25.30 0.62 25.36 1.32
rna.2GB 100.13 131.86 46.98 27.62 50.80 1.20 50.89 2.64
rna.3GB 142.90 220.11 71.09 41.00 75.37 2.17 66.35 3.79
rna.4GB - 198.10 94.39 55.04 101.44 2.84 - -
rna.5GB - 329.27 117.13 68.24 126.66 3.57 - -
rna.6GB - 389.25 141.59 81.80 152.57 4.35 - -
rna.13GB - 881.41 314.86 330.44 333.14 10.75 - -
prot 104.40 142.67 58.54 21.81 68.19 2.17 64.06 3.54
src.200MB 24.81 31.41 14.68 2.67 17.70 0.52 16.73 1.06
src.98MB 7.92 9.52 5.28 0.77 5.73 3.94 5.07 0.75
src.512MB 37.77 49.21 28.94 5.07 28.98 5.36 25.52 3.07
src.1GB 75.48 99.95 57.99 8.87 55.36 9.60 49.52 6.17
src.2GB 150.67 205.41 112.78 25.30 110.83 15.11 98.11 11.77
en.4.27 8.78 14.24.82 5.75 1.82 6.50 0.28 6.98 0.38
en.4.28 15.82 28.53 11.44 3.67 12.88 0.40 12.34 0.77
en.4.29 35.43 57.11 23.01 7.22 25.51 0.84 24.68 1.57
en.4.30 70.00 113.88 46.10 14.40 51.06 1.63 55.56 3.06
en.6.27 12.44 19.10 7.98 1.78 9.58 0.36 10.46 0.61
en.6.28 22.65 38.37 15.92 3.33 19.35 0.52 18.38 1.17
en.6.29 50.28 76.91 31.78 7.08 37.90 1.18 41.86 2.36
en.6.30 99.66 153.72 63.62 15.90 76.59 2.20 83.29 4.68
en.8.27 15.87 26.00 11.48 1.87 13.15 0.46 14.10 0.88
en.8.28 29.06 52.15 22.86 3.71 26.52 0.78 28.28 1.58
en.8.29 64.84 105.01 45.79 7.57 52.53 1.56 56.68 3.14
en.8.30 128.65 209.54 91.83 14.65 105.00 3.13 113.13 6.26
en.10.27 21.32 33.25 14.61 2.26 13.94 1.66 17.26 1.39
en.10.28 43.55 68.00 30.32 6.43 29.05 2.18 33.15 2.78
en.10.29 89.96 136.67 60.69 9.25 58.55 4.59 67.16 5.67
en.10.30 183.57 281.53 123.88 17.70 119.14 8.93 214.2 10.77
en.12.27 24.38 39.09 17.97 2.52 17.33 2.61 20.33 1.64
en.12.28 50.17 80.22 37.66 7.62 36.36 2.66 38.97 3.25
en.12.29 103.39 161.96 75.09 10.41 72.46 5.73 128.35 6.71
en.12.30 211.66 333.32 150.02 20.33 145.04 9.66 259.21 12.99
en.14.27 27.44 43.61 21.92 3.10 21.39 2.43 22.51 1.84
en.14.28 56.44 90.05 45.85 6.11 44.70 2.94 44.53 3.67
en.14.29 116.15 182.46 90.41 12.50 88.37 6.97 91.53 7.79
en.14.30 238.36 377.77 184.83 22.31 178.58 10.50 302.14 15.98

• Speedup with respect to the best sequential time. The caption of each figure indicates the name of the dataset, the input size $n$ and the alphabet size $\sigma$.

• Memory consumption.

• [1] Shun, J.: Parallel wavelet tree construction. In: Proceedings of the IEEE Data Compression Conference. Utah, USA (April 2015)
• [2] Fuentes-Sepúlveda, J., Elejalde, E., Ferres, L., Seco, D.: Efficient Wavelet Tree Construction and Querying for Multicore Architectures. In: Experimental Algorithms. pp. 150–161. (2014)

This work was supported the doctoral scholarship of CONICYT (21120974) and in part by the Emerging Leaders in the Americas scholarship programme