Practical Compressed String Dictionaries
Miguel A. Martínez-Prieto, Nieves Brisaboa, Rodrigo Cánovas,
Francisco Claude, and Gonzalo Navarro
The need to store and query a set of strings -- a string dictionary --
arises in many
kinds of applications. While classically these string dictionaries have
accounted for a small
share of the total space budget (e.g., in Natural Language Processing or when
indexing text
collections), recent applications in Web engines, Semantic Web (RDF) graphs,
Bioinformatics, and
many others, handle very large string dictionaries, whose size is a
significant fraction of the
whole data. In these cases, string dictionary management is a scalability
issue by itself.
This paper focuses on the problem of managing large static string dictionaries
in compressed
main memory space.
We revisit classical solutions for string dictionaries like hashing, tries,
and front-coding, and improve them by using compression techniques.
We also introduce some novel string dictionary representations built on top
of
recent advances in succinct data structures and full-text indexes.
All these structures are empirically compared on a heterogeneous testbed
formed by
real-world string dictionaries. We show that the compressed representations
may use as little as 5\% of the original dictionary size, while supporting
lookup
operations within a few microseconds. These numbers outperform the
state-of-the-art space/time
tradeoffs in many cases. Furthermore, we enhance some
representations to provide prefix- and substring-based searches, which also
perform
competitively. The results show that compressed string dictionaries are a
useful building
block for various data-intensive applications in different domains.