A Compact RDF Store using Suffix Arrays
Nieves Brisaboa, Ana Cerdeira, Antonio Fariña, and Gonzalo Navarro
RDF has become a standard format to describe resources in the Semantic Web and
other scenarios. RDF data is composed of triples (subject, predicate, object),
referring respectively to a resource, a property of that resource, and the
value of such property. Compact storage schemes allow fitting larger datasets
in main memory for faster processing. On the other hand, supporting efficient
SPARQL queries on RDF datasets requires index data structures to accompany the
data, which hampers compactness. As done for text collections, we introduce a
self-index for RDF data, which combines the data and its index in a single
representation that takes less space than the raw triples and efficiently
supports basic SPARQL queries. Our storage format, RDFCSA, builds on
compressed suffix arrays. Although there exist more compact representations of
RDF data, RDFCSA uses about half of the space of the raw data (and replaces
it) and displays much more robust and predictable query times around 1–2
microseconds per retrieved triple. RDFCSA is 3 orders of magnitude faster than
representations like MonetDB or RDF-3X, while using the same space as the
former and 6 times less space than the latter. It is also faster than the more
compact representations on most queries, in some cases by 2 orders of
magnitude.