# Parallel Construction of Succinct Trees

Leo Ferres, José Fuentes-Sepúlveda, Meng He, and Norbert Zeh

## Introduction

Trees are ubiquitous in Computer Science. They have applications in every aspect of computing from XML/HTML processing to abstract syntax trees (AST) in compilers, phylogenetic trees in computational genomics or shortest path trees in path planning. The ever increasing amounts of structured, hierarchical data processed in many applications have turned the processing of the corresponding large tree structures into a bottleneck, particularly when they do not fit in memory. Succinct tree representations store trees using as few bits as possible and thereby significantly increase the size of trees that fit in memory while still supporting important primitive operations in constant time. There exist such representations that use only $2⁣n+o\left(n\right)$ bits to store the topology of a tree with n nodes, which is close to the information-theoretic lower bound and much less than the space used by traditional pointer-based representations.

The construction of succinct trees is quite slow compared to the construction of pointer-based representations. Multicore parallelism offers one possible tool to speed up the construction of succinct trees, but little work has been done in this direction so far. The only results we are aware of focus on the construction of Wavelet trees, which are used in representations of text indexes.

In this work, we provide a parallel algorithm that constructs the Range Min-Max Tree representation of [1] in $O\left(\frac{n}{p}+\mathrm{lg}p\right)$ time using $p$ cores. This structure uses $2⁣n+o\left(n\right)$ bits to store an ordinal tree with n nodes and supports a rich set of basic operations on these trees in $\mathrm{lg}n$ time. While this query time is theoretically suboptimal, the Range Min-Max Tree structure is simple enough to be practical and has been verified experimentally to be very small and support fast queries in practice [2]. Combined with the fast parallel construction algorithm presented in this paper, it provides an excellent tool for manipulating very large trees in many applications. We implemented and tested our algorithm on a number of real-world input trees having billions of nodes. Our experiments show that our algorithm run on a single core is competitive with state-of-the-art sequential constructions of the Range Min-Max Tree structure and achieves good speed-up on up to $64$ cores and likely beyond.

## PDF document

An author's version of our work can be downloaded here. The link to the official version is here.

## The Code

The code for this work is available at github.

## Experiments

To evaluate the performance of our algorithm, we compare it against libcds and sdsl, which are state-of-the-art implementations of the Range Min-Max Tree. We implemented the our algorithm in C and compiled it using GCC 4.9 with optimization level -O2 and using the -ffast-math flag. All parallel code was compiled using the GCC Cilk branch. The same flags were used to compile libcds and sdsl, which were written in C++. We tested our algorithm on five inputs:
• wiki: The Wikipedia dump (January 12, 2015). This input has 498,753,914 parentheses.
• prot: Suffix tree of the protein data from the Pizza&Chili corpus. The suffix tree was constructed using this code. This input has 670,721,006 parentheses.
• dna: Suffix tree of the DNA data from the Pizza&Chili corpus. The suffix tree was constructed using this code. This input has 1,154,482,174 parentheses.
• ctree: Complete binary tree of depth 30. This input has 2,147,483,644 parentheses.
• osm: The OpenStreetMap dump (January 10, 2015). This input has 4,675,776,358 parentheses.
The experiments were carried out on a machine with four 16-core AMD Opteron® 6278 processors clocked at 2.4GHz, with 64KB of L1 cache per core, 2MB of L2 cache shared between two cores, and 6MB of L3 cache shared between 8 cores. The machine had 189GB of DDR3 RAM, clocked at 1333MHz. Running times were measured using the high-resolution (nanosecond) C functions in time.h. Memory usage was measured using the tools provided by malloc_count. In our experiments, the chunk size was fixed at 256.

### Results

p wiki prot dna ctree osm
libcds 33.16 44.24 75.87 140.41 339.21
sdsl 1.93 2.66 4.57 8.35 18.10
12.894.227.2112.1630.60
21.442.133.646.1515.43
40.731.101.873.187.98
80.370.570.981.594.14
120.310.440.721.153.07
160.250.350.580.862.21
200.230.320.550.761.83
240.210.280.460.691.72
280.180.280.450.631.55
320.180.250.390.631.33
360.160.250.390.551.33
400.180.260.350.511.27
440.170.270.350.491.14
480.220.280.380.531.21
520.240.260.330.461.09
560.210.280.390.491.02
600.280.290.390.461.04
640.270.290.390.481.01

• [1] Navarro, G., Sadakane, K.: Fully functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), 16:1–16:39 (May 2014)
• [2] Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: ALENEX. pp. 84–97. SIAM Press, Austin, Texas, USA (2010)

This work was supported the doctoral scholarship of CONICYT (21120974) and in part by the Emerging Leaders in the Americas scholarship programme