This document aims to make replicable the experiments described in the paper Reifying RDF: What Works Well With Wikidata? (by Daniel Hernández, Aidan Hogan and Markus Kröztsch). The objective of the experiments was to empirically compare the effect of different models of representing qualification on statements of Wikidata.
This document is in writting process. If you have a question or sugestion about the experiments, please write to Daniel Hernández (daniel@dcc.uchile.cl).
You can download the article, the results, the data in each schema and the queries. The original data was taken from the RDF dump of 2015-02-23.
The experiments where performed in a machine with a Intel Xeon CPU E5-2407 v2, a memory of 32GB (4*8GB DDR3 SDRAM Registered 1333 MHz) and 1TB of disk space (a RAID with 2 disks Seagate ST1000NM0033-9ZM, SATA3, 128MB of cache and 7.200 RPM).
Virtuso stores the data and the configuration in the directory:
/usr/local/virtuoso-opensource/var/lib/virtuoso/db/
Instead, we store the dabase of each model $MODEL
in the directory:
/var/wikidata/db/virtuoso/$MODEL/
Where model is one of sr
, np
, sp
or
ng
.
Thus, we create a symbolic link before starting the virtuoso server for each
model:
cd /usr/local/virtuoso-opensource/var/lib/virtuoso/ ln -s /var/wikidata/db/virtuoso/$MODEL/ db
Inside each directory database there is the configuration is in the file
virtuoso.ini
. In this configuration file we set the following
variables.
NumberOfBuffers = 2720000 MaxDirtyBuffers = 2000000 MaxQueryCostEstimationTime = 0 MaxQueryExecutionTime = 60
The values for the properties NumberOfBuffers
and
MaxDirtyBuffers
are the recomended for a machine with
32GB of memory (as our machine has).
The property
MaxQueryCostEstimationTime
indicates the maximun estimated
time for a query. If is estimated that a query will long more that this value,
then it will not executed. When this property is set 0 then no limits
will be considered for estimated execution. Finally, the
MaxQueryExecutionTime
indicates how much time a query have
to be executed. When a query execution reach this time then it is aborted. Thus,
we respectively set the variables MaxQueryCostEstimationTime
and
MaxQueryExecutionTime
to 0 and 60 seconds to ensure
a real timeout of one minute.
In Virtuoso the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.
We choose to set the heap size to 20GB in a environment variable to give enough memory to Fuseki before starting the server:
export JVM_ARGS=-Xmx20G
For each $MODEL
in sr
, np
and
sp
we start the fuseki server with the following command
inside the directory of the Fuseki distribution:
cd /var/wikidata/engines/fuseki ./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/$MODEL/ /DS
The parameter --timeout=60000
set the execution timeout to one
minute. Fuseki send an error if a query have sent not results before the
timeout or truncate the values if the timeout occurs when sending the results.
The named graphs model (ng
) requires setting the default graph as
the union of all the named graphs in the dataset. This is done adding the
the parameter --set tdb:unionDefaultGraph=true
in the command
that start the server:
cd /var/wikidata/engines/fuseki ./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/ng/ \ --set tdb:unionDefaultGraph=true /DS
export JAR_FILE=/var/wikidata/engines/blazegraph/bigdata-bundled.jar export PROPERTY_FILE=/var/wikidata/dbconfig/blazegraph/$MODEL.properties java -server -Xmx20g -Dbigdata.propertyFile=$PROPERTY_FILE -jar $JAR_FILE
In Blazegraph the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.
To get 20GB to the heap we set the following line in the file
startup.sh
inside the GraphDB distribution directory:
JVM_OPTS="-XX:PermSize=256m -XX:MaxPermSize=256m -Xmx20G"
In 4Store is recommended to use the double of procesor cores as segments. We set
this paramenter in the file metadata.nt
. We assume that we have 8
processors that implies having 16 segments. However, they where virtual cores
due the hyper-threading support of the CPU of our server, so it should been 8
segments instead of 16.