A Compact RDF Store using Suffix Arrays

Nieves Brisaboa, Ana Cerdeira, Antonio Fariña, and Gonzalo Navarro

RDF has become a standard format to describe resources in the Semantic Web and other scenarios. RDF data is composed of triples (subject, predicate, object), referring respectively to a resource, a property of that resource, and the value of such property. Compact storage schemes allow fitting larger datasets in main memory for faster processing. On the other hand, supporting efficient SPARQL queries on RDF datasets requires index data structures to accompany the data, which hampers compactness. As done for text collections, we introduce a self-index for RDF data, which combines the data and its index in a single representation that takes less space than the raw triples and efficiently supports basic SPARQL queries. Our storage format, RDFCSA, builds on compressed suffix arrays. Although there exist more compact representations of RDF data, RDFCSA uses about half of the space of the raw data (and replaces it) and displays much more robust and predictable query times around 1–2 microseconds per retrieved triple. RDFCSA is 3 orders of magnitude faster than representations like MonetDB or RDF-3X, while using the same space as the former and 6 times less space than the latter. It is also faster than the more compact representations on most queries, in some cases by 2 orders of magnitude.