This document aims to make replicable the experiments described in the paper Reifying RDF: What Works Well With Wikidata? (by Daniel Hernández, Aidan Hogan and Markus Kröztsch). The objective of the experiments was to empirically compare the effect of different models of representing qualification on statements of Wikidata.
This document is in writting process. If you have a question or sugestion about the experiments, please write to Daniel Hernández (email@example.com).
You can download the article, the results, the data in each schema and the queries. The original data was taken from the RDF dump of 2015-02-23.
The experiments where performed in a machine with a Intel Xeon CPU E5-2407 v2, a memory of 32GB (4*8GB DDR3 SDRAM Registered 1333 MHz) and 1TB of disk space (a RAID with 2 disks Seagate ST1000NM0033-9ZM, SATA3, 128MB of cache and 7.200 RPM).
Virtuso stores the data and the configuration in the directory:
Instead, we store the dabase of each model
$MODEL in the directory:
Where model is one of
Thus, we create a symbolic link before starting the virtuoso server for each
cd /usr/local/virtuoso-opensource/var/lib/virtuoso/ ln -s /var/wikidata/db/virtuoso/$MODEL/ db
Inside each directory database there is the configuration is in the file
virtuoso.ini. In this configuration file we set the following
NumberOfBuffers = 2720000 MaxDirtyBuffers = 2000000 MaxQueryCostEstimationTime = 0 MaxQueryExecutionTime = 60
The values for the properties
MaxDirtyBuffers are the recomended for a machine with
32GB of memory (as our machine has).
MaxQueryCostEstimationTime indicates the maximun estimated
time for a query. If is estimated that a query will long more that this value,
then it will not executed. When this property is set 0 then no limits
will be considered for estimated execution. Finally, the
MaxQueryExecutionTime indicates how much time a query have
to be executed. When a query execution reach this time then it is aborted. Thus,
we respectively set the variables
MaxQueryExecutionTime to 0 and 60 seconds to ensure
a real timeout of one minute.
In Virtuoso the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.
We choose to set the heap size to 20GB in a environment variable to give enough memory to Fuseki before starting the server:
sp we start the fuseki server with the following command
inside the directory of the Fuseki distribution:
cd /var/wikidata/engines/fuseki ./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/$MODEL/ /DS
--timeout=60000 set the execution timeout to one
minute. Fuseki send an error if a query have sent not results before the
timeout or truncate the values if the timeout occurs when sending the results.
The named graphs model (
ng) requires setting the default graph as
the union of all the named graphs in the dataset. This is done adding the
--set tdb:unionDefaultGraph=true in the command
that start the server:
cd /var/wikidata/engines/fuseki ./fuseki-server --timeout=60000 --loc /var/wikidata/db/fuseki/ng/ \ --set tdb:unionDefaultGraph=true /DS
export JAR_FILE=/var/wikidata/engines/blazegraph/bigdata-bundled.jar export PROPERTY_FILE=/var/wikidata/dbconfig/blazegraph/$MODEL.properties java -server -Xmx20g -Dbigdata.propertyFile=$PROPERTY_FILE -jar $JAR_FILE
In Blazegraph the default dataset is always, assumed as the union of all named graphs. Thus, no specific configuration is needed to get this behabiour with the named graphs model.
To get 20GB to the heap we set the following line in the file
startup.sh inside the GraphDB distribution directory:
JVM_OPTS="-XX:PermSize=256m -XX:MaxPermSize=256m -Xmx20G"
In 4Store is recommended to use the double of procesor cores as segments. We set
this paramenter in the file
metadata.nt. We assume that we have 8
processors that implies having 16 segments. However, they where virtual cores
due the hyper-threading support of the CPU of our server, so it should been 8
segments instead of 16.