Reifying RDF: What Works Well With Wikidata?

This document aims to make replicable the experiments described in the paper Reifying RDF: What Works Well With Wikidata? (by Daniel Hernández, Aidan Hogan and Markus Kröztsch). The objective of the experiments was to empirically compare the effect of different models of representing qualification on statements of Wikidata.

This document is in writting process. If you have a question or sugestion about the experiments, please write to Daniel Hernández (daniel@dcc.uchile.cl).

You can download the article, the results, the data in each schema and the queries. The original data was taken from the RDF dump of 2015-02-23.

Hardware

The experiments where performed in a machine with a Intel Xeon CPU E5-2407 v2, a memory of 32GB (4*8GB DDR3 SDRAM Registered 1333 MHz) and 1TB of disk space (a RAID with 2 disks Seagate ST1000NM0033-9ZM, SATA3, 128MB of cache and 7.200 RPM).

Database engine configuration

Virtuoso

Virtuso stores the data and the configuration in the directory:

/usr/local/virtuoso-opensource/var/lib/virtuoso/db/

Instead, we store the dabase of each model $MODEL in the directory:

/var/wikidata/db/virtuoso/$MODEL/

Where model is one of sr, np, sp or ng. Thus, we create a symbolic link before starting the virtuoso server for each model:

cd /usr/local/virtuoso-opensource/var/lib/virtuoso/
ln -s /var/wikidata/db/virtuoso/$MODEL/ db

Inside each directory database there is the configuration is in the file virtuoso.ini. In this configuration file we set the following variables.

NumberOfBuffers            = 2720000
MaxDirtyBuffers            = 2000000
MaxQueryCostEstimationTime = 0
MaxQueryExecutionTime      = 60

The values for the properties NumberOfBuffers and MaxDirtyBuffers are the recomended for a machine with 32GB of memory (as our machine has).

The property MaxQueryCostEstimationTime indicates the maximun estimated time for a query. If is estimated that a query will long more that this value, then it will not executed. When this property is set 0 then no limits will be considered for estimated execution. Finally, the MaxQueryExecutionTime indicates how much time a query have to be executed. When a query execution reach this time then it is aborted. Thus, we respectively set the variables MaxQueryCostEstimationTime and MaxQueryExecutionTime to 0 and 60 seconds to ensure a real timeout of one minute.

In Virtuoso the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.

Fuseki

We choose to set the heap size to 20GB in a environment variable to give enough memory to Fuseki before starting the server:

export JVM_ARGS=-Xmx20G

For each $MODEL in sr, np and sp we start the fuseki server with the following command inside the directory of the Fuseki distribution:

cd /var/wikidata/engines/fuseki
./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/$MODEL/ /DS

The parameter --timeout=60000 set the execution timeout to one minute. Fuseki send an error if a query have sent not results before the timeout or truncate the values if the timeout occurs when sending the results.

The named graphs model (ng) requires setting the default graph as the union of all the named graphs in the dataset. This is done adding the the parameter --set tdb:unionDefaultGraph=true in the command that start the server:

cd /var/wikidata/engines/fuseki
./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/ng/ \
      --set tdb:unionDefaultGraph=true /DS

Blazegraph

In the same way that with Fuseki, we set the heap to 20GB. For each $MODEL the Blazegraph server is started with:
export JAR_FILE=/var/wikidata/engines/blazegraph/bigdata-bundled.jar
export PROPERTY_FILE=/var/wikidata/dbconfig/blazegraph/$MODEL.properties
java -server -Xmx20g -Dbigdata.propertyFile=$PROPERTY_FILE -jar $JAR_FILE

In Blazegraph the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.

GraphDB

To get 20GB to the heap we set the following line in the file startup.sh inside the GraphDB distribution directory:

JVM_OPTS="-XX:PermSize=256m -XX:MaxPermSize=256m -Xmx20G"

4Store

In 4Store is recommended to use the double of procesor cores as segments. We set this paramenter in the file metadata.nt. We assume that we have 8 processors that implies having 16 segments. However, they where virtual cores due the hyper-threading support of the CPU of our server, so it should been 8 segments instead of 16.