About
This documentation aims to help reproduce the experiments described in the paper Querying Wikidata: Comparing SPARQL, Relational and Graph Databases (by Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas and Enzo Zerega).
These pages were generated using Jekyll from the source available in the repository wquery-docs. Also, this documentation is accessible on the following URLs:
- http://users.dcc.uchile.cl/~dhernand/wquery/ (this website).
- https://dx.doi.org/10.6084/m9.figshare.3219217.v3 (DOI pointing this website).
The documentation is laid out as follows:
- License: details of licencing for original and third-party material
- Dataset and code: links to download (mirrored) dataset and necessary code
- Queries: details of queries generated for the benchmark
- Experimental Setting: details of the hardware/software setting used in the paper
- Repeating the experiments: step-by-step guide for repeating the experiments presented
- Results: raw and supplementary results supporting the paper
License
All our code and documentation is published under the Creative Commons CC-BY License. The Wikidata dump is published under Creative Commons CC0 License.
All of the engines used in these experiments are distributed under open licenses. PostgreSQL uses the PostgreSQL License, Virtuoso Opensource uses the GPLv2 Licence, Blazegraph uses the GPLv2 License licence and Neo4j Community Edition uses the GPLv3 License.
Dataset and code
To reproduce these experiments, it is necessary to download and install the database engines and the following resources:
Code and results (https://bitbucket.org/danielhz/wikidata-experiments). The code required to repeat these experiments is published in this repository. Also, parameters to generate queries and results of experiments are published in this repository.
Dataset (https://dx.doi.org/10.6084/m9.figshare.3208498.v1). All experiments are done using the dump of Wikidata published on January 4th, 2016. The original dump was downloaded from the dumps folder published by the Wikimedia Foundation. However, the contents in this folder are frequently updated and old dumps are discarded. Thus, in the interest of sustainability, we published the dump used on Figshare.
Queries
The experiments presented in the paper consider two groups of queries: atomic queries that are based on quins and more complex queries that are structured as snowflakes of various depth.
Atomic quin queries
Quin queries have the following components:
?x0
Item (s).?x1
Claim property (p).?x2
Property value (o).?x3
Qualifier property (q).?x4
Qualifier value (v).
We create different atomic queries using different patterns where various
positions are either variable or constant: we use bitmasks to represent
these patterns where, for example the bitmask 01110
indicates that
p
, o
, q
are variables and s
, v
are constants.
For each bitmask a file with 300 queries is generated, where each query is a quin
following that bitmask. For example, the following quin
is in the CSV file query_parameters/quins/quins_01110.csv
of the repository
(that corresponds to the bitmask 01110
).
?x0 P2239 Q21402571 P636 ?x4
Note: Each such query is generated from the data and thus will produce non-empty results.
Note: The variables ?x1
to ?x3
are replaced with constants because
the bitmask contains a 1 in these positions.
Note: The property value and the qualifier value are value items (not datatype values).
Each quin will then be converted into a “concrete query” for the particular
setting, where we will momentarily present examples for the previous 01110
quin.
Note: If the qualifier property or the qualifier value are not specified,
then a left outer join is used. Thus, the operators OPTIONAL
,
OPTIONAL MATCH
and LEFT OUTER JOIN
are used in the SPARQL, Cypher and
SQL implementations, respectively.
SPARQL (n-ary relations)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
SELECT ?s ?qo
WHERE { ?s p:P2239 ?c . ?c ps:P2239 wd:Q21402571 ; p:P636 ?qo . }
LIMIT 10000
SPARQL (named graphs)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?s ?qo
WHERE { GRAPH ?c { ?s p:P2239 wd:Q21402571 . ?c p:P636 ?qo } .
FILTER (?s != ?c) }
LIMIT 10000
SPARQL (singleton properties)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?s ?qo
WHERE { ?s ?c wd:Q21402571 .
?c rdf:singletonPropertyOf p:P2239 ; p:P636 ?qo . }
LIMIT 10000
SPARQL (standard reification)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?s ?qo
WHERE { ?c rdf:subject ?s ;
rdf:predicate p:P2239 ;
rdf:object wd:Q21402571 ;
p:P636 ?qo . }
LIMIT 10000
Cypher
MATCH (s:Item)-[:PROP_FROM]->(c:Claim)-[:PROP_TO]->(o),
(c)-[:PROPERTY]->(p:Property),
(c)-[:QUAL_FROM]->(qn:Qualifier)-[:QUAL_TO]->(q),
(qn)-[:PROPERTY]->(qp:Property)
WHERE p.id='P2239' AND
o.id='Q21402571' AND
qp.id='P636'
RETURN s.id, q.id
LIMIT 10000;
SQL
SELECT
claims.entity_id,
qualifiers.datavalue_entity
FROM
claims,
qualifiers
WHERE
qualifiers.claim_id = claims.id AND
claims.property = 'P2239' AND
claims.datavalue_entity = 'Q21402571' AND
qualifiers.property = 'P636'
Snowflake (depth-*) queries
Snowflake queries are generated using a list of nodes (codified in JSON).
Each node represents an item (identified with the attribute entity_id
)
of such nodes. For each node, a set of claims is selected. Each claim is
codified with a pair [P,X]
, where P
is the claim property and X
is a
item value id, a 0 (the item value of the claim is not projected in the
solution) or a 1 (the item value of the claim is projected in the solution).
Also, the attribute property
codifies the property that joins nodes in the
snowflake. For example, the following is a raw snowflake pattern:
[{"entity_id":"Q6260140",
"claims":[["P102",0],["P735","Q4925477"],["P31",0]],
"property":"P27"},
{"entity_id":"Q30"
"claims":[["P1792",1]],
"property":"P530"},
{"entity_id":"Q408",
"claims":[["P1343",1],["P1792",1]]}]
The snowflake parameters are stored in the folder
query_parameters/paths
of the code repository. To generate these
parameters we first generate the tables containing tuples with items
that are connected and that have at least an out-degree of 5 to
other items. The raw tables are generated with the SQL scripts in
the folder wikidata-experiments/sql/parameters_2
. Then we run the
following command:
cd wikidata-experiments/sql/parameters_2
./generate_path.rb 1 500
./generate_path.rb 2 500
In the following, we give examples of the concrete queries generated from the raw snowflake presented above.
SPARQL (n-ary relations)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX wikibase: <http://wikiba.se/ontology-beta#>
SELECT ?x0 ?x1 ?x1y0 ?x2 ?x2y0 ?x2y1
WHERE {
?x0 p:P102 ?claim_x0y0 .
?claim_x0y0 ps:P102 ?x0y0 .
?x0 p:P735 ?claim_x0y1 .
?claim_x0y1 ps:P735 wd:Q4925477 .
?x0 p:P31 ?claim_x0y2 .
?claim_x0y2 ps:P31 ?x0y2 .
?x0 p:P27 ?claim_x0 .
?claim_x0 ps:P27 ?x1 .
?x1 p:P1792 ?claim_x1y0 .
?claim_x1y0 ps:P1792 ?x1y0 .
?x1 p:P530 ?claim_x1 .
?claim_x1 ps:P530 ?x2 .
?x2 p:P1343 ?claim_x2y0 .
?claim_x2y0 ps:P1343 ?x2y0 .
?x2 p:P1792 ?claim_x2y1 .
?claim_x2y1 ps:P1792 ?x2y1 . }
LIMIT 10000
SPARQL (named graphs)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?x0 ?x1 ?x1y0 ?x2 ?x2y0 ?x2y1
WHERE {
GRAPH ?claim_x0y0 { ?x0 p:P102 ?x0y0 } .
GRAPH ?claim_x0y1 { ?x0 p:P735 wd:Q4925477 } .
GRAPH ?claim_x0y2 { ?x0 p:P31 ?x0y2 } .
GRAPH ?claim_x0 { ?x0 p:P27 ?x1 } .
GRAPH ?claim_x1y0 { ?x1 p:P1792 ?x1y0 } .
GRAPH ?claim_x1 { ?x1 p:P530 ?x2 } .
GRAPH ?claim_x2y0 { ?x2 p:P1343 ?x2y0 } .
GRAPH ?claim_x2y1 { ?x2 p:P1792 ?x2y1 } . }
LIMIT 10000
SPARQL (singleton properties)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?x0 ?x1 ?x1y0 ?x2 ?x2y0 ?x2y1
WHERE {
?x0 ?claim_x0y0 ?x0y0 . ?claim_x0y0 rdf:singletonPropertyOf p:P102 .
?x0 ?claim_x0y1 wd:Q4925477 . ?claim_x0y1 rdf:singletonPropertyOf p:P735 .
?x0 ?claim_x0y2 ?x0y2 . ?claim_x0y2 rdf:singletonPropertyOf p:P31 .
?x0 ?claim_x0 ?x1 . ?claim_x0 rdf:singletonPropertyOf p:P27 .
?x1 ?claim_x1y0 ?x1y0 . ?claim_x1y0 rdf:singletonPropertyOf p:P1792 .
?x1 ?claim_x1 ?x2 . ?claim_x1 rdf:singletonPropertyOf p:P530 .
?x2 ?claim_x2y0 ?x2y0 . ?claim_x2y0 rdf:singletonPropertyOf p:P1343 .
?x2 ?claim_x2y1 ?x2y1 . ?claim_x2y1 rdf:singletonPropertyOf p:P1792 . }
LIMIT 10000
SPARQL (standard reification)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?x0 ?x1 ?x1y0 ?x2 ?x2y0 ?x2y1
WHERE {
?claim_x0y0 rdf:subject ?x0 .
?claim_x0y0 rdf:predicate p:P102 .
?claim_x0y0 rdf:object ?x0y0 .
?claim_x0y1 rdf:subject ?x0 .
?claim_x0y1 rdf:predicate p:P735 .
?claim_x0y1 rdf:object wd:Q4925477 .
?claim_x0y2 rdf:subject ?x0 .
?claim_x0y2 rdf:predicate p:P31 .
?claim_x0y2 rdf:object ?x0y2 .
?claim_x0 rdf:subject ?x0 .
?claim_x0 rdf:predicate p:P27 .
?claim_x0 rdf:object ?x1 .
?claim_x1y0 rdf:subject ?x1 .
?claim_x1y0 rdf:predicate p:P1792 .
?claim_x1y0 rdf:object ?x1y0 .
?claim_x1 rdf:subject ?x1 .
?claim_x1 rdf:predicate p:P530 .
?claim_x1 rdf:object ?x2 .
?claim_x2y0 rdf:subject ?x2 .
?claim_x2y0 rdf:predicate p:P1343 .
?claim_x2y0 rdf:object ?x2y0 .
?claim_x2y1 rdf:subject ?x2 .
?claim_x2y1 rdf:predicate p:P1792 .
?claim_x2y1 rdf:object ?x2y1 . }
LIMIT 10000
Cypher
MATCH (x0:Item)-[:PROP_FROM]->(c0:Claim)-[:PROP_TO]->(x1:Item),
(c0)-[:PROPERTY]->(:Property {id:"P27"}),
(x0)-[:PROP_FROM]->(cx0y0:Claim)-[:PROP_TO]->(x0y0:Item),
(cx0y0)-[:PROPERTY]->(:Property {id:"P102"}),
(x0)-[:PROP_FROM]->(cx0y1:Claim)-[:PROP_TO]->(x0y1:Item {id:"Q4925477"}),
(cx0y1)-[:PROPERTY]->(:Property {id:"P735"}),
(x0)-[:PROP_FROM]->(cx0y2:Claim)-[:PROP_TO]->(x0y2:Item),
(cx0y2)-[:PROPERTY]->(:Property {id:"P31"}),
(x1:Item)-[:PROP_FROM]->(c1:Claim)-[:PROP_TO]->(x2:Item),
(c1)-[:PROPERTY]->(:Property {id:"P530"}),
(x1)-[:PROP_FROM]->(cx1y0:Claim)-[:PROP_TO]->(x1y0:Item),
(cx1y0)-[:PROPERTY]->(:Property {id:"P1792"}),
(x2)-[:PROP_FROM]->(cx2y0:Claim)-[:PROP_TO]->(x2y0:Item),
(cx2y0)-[:PROPERTY]->(:Property {id:"P1343"}),
(x2)-[:PROP_FROM]->(cx2y1:Claim)-[:PROP_TO]->(x2y1:Item),
(cx2y1)-[:PROPERTY]->(:Property {id:"P1792"})
RETURN x0.id x1.id x1y0.id x2.id x2y0.id x2y1.id
LIMIT 10000;
SQL
SELECT
c0.entity_id,
c1.entity_id,
x1y0.datavalue_entity,
c2.entity_id,
x2y0.datavalue_entity,
x2y1.datavalue_entity
FROM
claims AS c0,
claims AS c1,
claims AS cx0y0,
claims AS cx0y1,
claims AS cx0y2,
claims AS cx1y0,
claims AS cx2y0,
claims AS cx2y1
WHERE
c0.datavalue_entity = c1.entity_id AND
c0.property = 'P27' AND
c0.datavalue_entity = cx0y0.entity_id AND
cx0y0.property = 'P102' AND
c0.datavalue_entity = cx0y1.entity_id AND
cx0y1.property = 'P735' AND
cx0y1.datavalue_entity = 'Q4925477' AND
c0.datavalue_entity = cx0y2.entity_id AND
cx0y2.property = 'P31' AND
c1.datavalue_entity = c2.entity_id AND
c1.property = 'P530' AND
c1.datavalue_entity = cx1y0.entity_id AND
cx1y0.property = 'P1792' AND
c2.datavalue_entity = cx2y0.entity_id AND
cx2y0.property = 'P1343' AND
c2.datavalue_entity = cx2y1.entity_id AND
cx2y0.property = 'P1792'
Experimental settings
Here we describe the specific setting in which we ran experiments for the paper.
Machine: All experiments were run on a single machine with 2× Intel Xeon Six Core E5-2609 V3 CPUs, 32GB of RAM, and 2× 1TB Seagate 7200 RPM 32MB Cache SATA hard-disks in a RAID-1 configuration.
System: All experiments where run on a Debian 7 system using Ruby 2.3,
Python 2 and Bash scripts. The partition mounted in /
(which has only 55GB)
is not enough for the data used. Thus, when necessary we use the partition
mounted in /home
(which has 1.7TB) for the data.
Timeouts: We assumed a server-side timeout of 60 seconds per query.
Repeating the experiments
The list below enumerates the tasks required to perform the experiments. The figure below shows what tasks can be done in parallel. Tasks that are enclosed in squares should be done on a dedicated machine (without other processes running in the background). We recommend executing these tasks with GNU Screen.
Note: We assume that distributions compatible with Ruby 2.3, Python 2 and Java 7.0 have been installed.
Note: In the documentation of these tasks, shell variables (e.g. $USER
,
$WD_HOME
) will be used. You will have to set these shell variables according
to your own settings.
Note: All experiments will be run by a user identified as $USER
whose
home is $USER_HOME
. Also, $USER
should be in the sudo
group.
Diagram of tasks
List of tasks
- Download the code.
- Download the dataset.
- Generate the raw queries
- Translate the raw data to RDF.
- Configure Virtuoso
- Load the data into Virtuoso.
- Configure Blazegraph
- Load the data into Blazegraph.
- Generate SPARQL queries and run experiments in Virtuoso and Blazegraph
- Translate the raw data to relational model.
- Configure PostgreSQL.
- Load the data into PostgreSQL.
- Generate SQL queries and run experiments in PostgreSQL.
- Translate the data to the Neo4j data model.
- Configure Neo4j.
- Load the data in Neo4j.
- Run experiments in Neo4j.
1. Download the code
The code is tracked in a git repository. You can get the code with the following command:
git clone https://bitbucket.org/danielhz/wikidata-experiments.git
In what follows, we refer to the folder created by git
as $WD_HOME
.
2. Download the dataset
You can download the dataset with the following command:
wget https://dx.doi.org/10.6084/m9.figshare.3208498.v1
In what follows, we refer to the file created by wget
as $DATASET
.
3. Generate the raw queries
The next step is to generate raw queries from the data. These queries will later be converted to concrete queries in SPARQL, SQL and Cypher. Please see the Queries section for more details.
To generate the atomic quin queries, execute:
cd $WD_HOME
bin/generate_quins.rb
To generate the snowflake queries at depth 1 and 2, execute:
cd wikidata-experiments/sql/parameters_2
./generate_path.rb 1 500
./generate_path.rb 2 500
4. Translate the raw data to RDF
Before translating the data to the RDF model, we need to split it into several smaller files (the translation process leaks memory so splitting it ensures that the process ends without getting an out memory exception). Translating each small file is done by an independent process.
The commands below prepare the dataset. We assume that $DATASET
is
the path to the dataset, $RDF
is the folder where we store
the RDF versions of the dataset and $WD_HOME
is the folder of the
code.
cd $RDF
bunzip2 -c $DATASET | split -d -a 3 -C 100000000
rename '/$/.json/' x*
gzip x*.json
cd $WD_HOME
translation/translate_all.rb $RDF
After running the above commands, the $RDF
folder will contain several JSON
files (e.g., x000.json.gz
, x001.json.gz
, where x001
indicates a file-part).
For each of these JSON files, four NQuads files are created, one for each schema.
They are named using the keywords naryrel
(n-ary relations), ngraphs
(named graphs),
sgprop
(singleton properties) and stdreif
(standard reification).
For example, the file-part x000.json.gz
is translated into
the files x000-naryrel.nq.gz
, x000-ngraphs.nq.gz
,
x000-sgprop.nq.gz
and x000-stdreif.nq.gz
.
Note: We use the environment name $SCHEMA
for the current schema in RDF experiments,
which can be one of naryrel
, ngraphs
, sgprop
or stdreif
.
5. Configure Virtuoso
We used Virtuoso Open Source Edition (7.2.3-dev.3215-pthreads), compiled from the source.
For each $SCHEMA
there is a directory
$WD_HOME/dbfiles/virtuoso/db-$SCHEMA-1
. Initially, this folder contains
only the file virtuoso.ini
inside.
By default, Virtuoso stores the data in a folder in the /
partition
(that does not have enough space in our setting). Inside this folder
we create a symbolic link to the database folder of each $SCHEMA
.
$ cd /usr/local/virtuoso-opensource/var/lib/virtuoso/
$ ln -s $WD_HOME/dbfiles/virtuoso/db-$SCHEMA-1 db-$SCHEMA-1
The following variables are set in virtuoso.ini
file of each $SCHEMA
.
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
MaxQueryCostEstimationTime = 0
MaxQueryExecutionTime = 60
The values for the properties NumberOfBuffers
and MaxDirtyBuffers
are
recommended in the configuration file for a machine with 32GB of memory
(as our machine has).
The property MaxQueryCostEstimationTime
indicates the maximum estimated time
for a query. If the engine estimates that a query will take longer than this
value, then it will not try to evaluate it. We set this property to 0, which
means that no estimation limits are applied; i.e., Virtuoso will attempt
to evaluate all queries.
Finally, the MaxQueryExecutionTime
is the timeout for query execution.
Queries that exceed this timeout are aborted at runtime.
Note: In Virtuoso the default dataset is always assumed to be the union of all named graphs. Thus, no specific configuration is needed to get this behavior with the named graphs schema.
6. Load the data into Virtuoso
To load the data it is assumed that Virtuoso is installed into
/usr/local/virtuoso-opensource/var/lib/virtuoso/
($VIRTUOSO_HOME
in what follows). Also, we assume that the file $VIRTUOSO_CONFIG
contains the configuration details for Virtuoso and that
$DB_NARYREL
, $DB_NAGRAPHS
, $DB_SGPROP
and
$DB_STDREIF
are empty directories.
We load the data into Virtuoso with the following commands:
cd $DB_NARYREL
ln -s $VIRTUOSO_CONFIG virtuoso.ini
mkdir wikidata
cd $DB_NGRAPHS
ln -s $VIRTUOSO_CONFIG virtuoso.ini
mkdir wikidata
cd $DB_SGPROP
ln -s $VIRTUOSO_CONFIG virtuoso.ini
mkdir wikidata
cd $DB_STDREIF
ln -s $VIRTUOSO_CONFIG virtuoso.ini
mkdir wikidata
cd $WD_HOME/loading/virtuoso/
./load_data.rb
7. Configure Blazegraph
We used Blazegraph 2.1.0 Community Edition with Java 7. We use the Java implementation distributed by ORACLE (Java(TM) SE Runtime Environment build 1.7.0_80-b15).
Some parameters are added to the command line to improve the resource usage of the process. We set the JVM heap to 6GB and we use the G1 garbage collector of the Hostpot JVM (The JVM provides several garbage collectors): the documentation recommends these parameters.
We use the exec
primitive in a Ruby script to start Blazegraph
with the following parameters:
exec(['java', 'blazegraph'],
'-Xmx6g',
'-XX:+UseG1GC',
'-Djetty.overrideWebXml=override.xml',
'-Dbigdata.propertyFile=server.properties',
'-jar',
'blazegraph.jar')
The override.xml
file define a timeout of 60 seconds for query execution
and the server.properties
file defines the parameters of the execution and
properties of the storage. Both configuration files can be found in the
code repository
of our experiments.
Note: In Blazegraph the default dataset is always assumed to be the union of all named graphs. Thus, no specific configuration is needed to get this behavior with the named graphs schema.
8. Load the data into Blazegraph
Blazegraph provides several data storage options. We use triple stores for experiments with n-ary relations, singleton properties and standard reification. For named graphs, we use a quad store.
The configuration for each backend is in the files
$WD_HOME/dbfiles/blazegraph/triples.properties
and
$WD_HOME/dbfiles/blazegraph/quads.properties
, respectively.
9. Generate SPARQL queries and run experiments in Virtuoso and Blazegraph
The SPARQL benchmarks use a configuration file containing the parameters that are necessary to run each experiment. The procedures to run experiments in Virtuoso and Blazegraph are the same. Thus, the following commands show how experiments are run in both engines.
$ cd $WD_HOME
$ bin/run_quins_benchmark config/virtuoso.rb
$ bin/run_quins_benchmark config/blazegraph.rb
$ bin/run_paths_benchmark config/paths_virtuoso.rb
$ bin/run_paths_benchmark config/paths_blazegraph.rb
Results for the quin benchmark are published in the folder results/quins
of
the repository. Similarly, the results for the experiments with snowflake
structure are published in the folder results/paths
.
After running these scripts, a CSV file is created for each set of queries.
Each file-name contains a reference to the engine, representation and query-set.
For example, the file results_blazegraph_onaryrel_01110.csv
corresponds
to the results obtained for atomic quin queries with bitmask key 01110
using the Blazegraph engine for the n-ary relations representation.
The following is an example line from the results:
naryrel,path_1,0,11.50489629,125,200
We see, respectively, the RDF schema, the type of query run, the ID of the specific query, the time taken, the number of results returned (if not timeout) and a 200 code indicating success (or timeout
if a timeout is reached).
10. Translate the data to the relational model
The following commands generate six CSV files from the Wikidata dump
in the folder csv
(relative to the script). We assume that
$DATASET
is the file containing the original dataset.
cd $WD_HOME/postgresql-experiment-scripts/loading-data/
bunzip2 -c $DATASET > dump.json
ruby migrador.rb
11. Configure PostgreSQL
We used PostgreSQL 9.1.20 with the folowing variables set in the
postgres.conf
file:
default_statistics_target = 100
maintenance_work_mem = 1920MB
shared_buffers = 7680MB
wal_buffers = 16MB
effective_cache_size = 22GB
work_mem = 160MB
default_transaction_isolation = 'read uncommitted'
statement_timeout = 60010
We used pgtune: a script to
automatically generate a configuration for PostgresSQL on our server.
This script is based on the recommendations in the
PostgresSQL Wiki
and sets the values of properties maintenance_work_mem
,
shared_buffers
and effective_cache_size
. We also set
the lowest level of isolation whereby transactions are isolated just enough
to ensure that physically corrupt data are not read.
We set a B-tree index for primary keys. By default PostgreSQL does not set any index for foreign keys because these are used for consistency and not for performance. We created a secondary index for each foreign key in the model and for each attribute that stores either entities, properties or data values (e.g. dates) from Wikidata.
12. Load the data in PostgreSQL
Before loading these CSV files into PostgreSQL, it is necessary to create the corresponding tables with the following schema:
Table "public.entities"
Column | Type | Modifiers
--------+------+-----------
id | text | not null
type | text |
value | text |
Indexes:
"entities_id" PRIMARY KEY, btree (id)
Table "public.labels"
Column | Type | Modifiers
----------+------+-----------
id | text | not null
language | text |
value | text |
Indexes:
"labels_id" btree (id)
"entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
"labels_language" btree (language)
Table "public.descriptions"
Column | Type | Modifiers
----------+------+-----------
id | text | not null
language | text |
value | text |
Indexes:
"descriptions_id" btree (id)
"entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
"descriptions_language" btree (language)
Table "public.aliases"
Column | Type | Modifiers
----------+------+-----------
id | text |
language | text |
value | text |
Indexes:
"aliases_id" btree (id)
"entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
"aliases_language_idx" btree (language)
Table "public.claims"
Column | Type | Modifiers
------------------+------+-----------
entity_id | text |
id | text | not null
type | text |
rank | text |
snaktype | text |
property | text |
datavalue_string | text |
datavalue_entity | text |
datavalue_date | text |
datavalue_type | text |
datatype | text |
Indexes:
"claims_id" PRIMARY KEY, btree (id)
"claims_datavalue_entity" btree (datavalue_entity)
"claims_entity_id" btree (entity_id)
"entities_fk" FOREIGN KEY (entity_id) REFERENCES entities(id)
"claims_property" btree (property)
"properties_fk" FOREIGN KEY (property) REFERENCES entities(id)
Table "public.qualifiers"
Column | Type | Modifiers
--------------------+------+-----------
claim_id | text | not null
property | text |
hash | text |
snaktype | text |
qualifier_property | text |
datavalue_string | text |
datavalue_entity | text |
datavalue_date | text |
datavalue_type | text |
datatype | text |
Indexes:
"qualifiers_claim_id" btree (claim_id)
"claims_fk" FOREIGN KEY (claim_id) REFERENCES claims(id)
"properties_fk" FOREIGN KEY (property) REFERENCES entities(id)
"qualifiers_datavalue_entity" btree (datavalue_entity)
"qualifiers_property" btree (property)
Then, the CSV files can be loaded with the following command:
cd $WD_HOME/postgresql-experiment-scripts/loading-data/
ruby script_commands.rb
13. Generate SQL queries and run experiments in PostgreSQL
The quins file must be in the same folder as the script that executes the
quins benchmark. For example, the experiment for the bitmask 10000
is
executed with the commands:
cd $WD_HOME/postgresql-experiment-scripts/run-queries
ln -s $WD_HOME/query_parameters/quins/quins_10000 quins.csv
ruby script_f.rb
rm quins.csv
Similarly, snowflake benchmarks are executed with the commands:
cd $WD_HOME/postgresql-experiment-scripts/run-queries
ln -s $WD_HOME/query_parameters/paths/path_1.json path_1.json
ruby path1_production.rb
rm path_1.json
ln -s $WD_HOME/query_parameters/paths/path_2.json path_2.json
ruby path2_production.rb
rm path_2.json
14. Translate the data to the Neo4j data model
These scripts assume the existence of the following environment variables:
$WD_HOME
(where the repository was cloned).$DATASET
(the path of the compressed JSON file).$DATASET_UNCOMPRESSED
(the path of the uncompressed dataset).$LANG_LABELS
(the folder to store labels).$LANG_DESCRIPTIONS
(the folder to store descriptions).$LANG_ALIASES
(the folder to store aliases).$MODEL_LIGHT
(the folder to store the light model).$MODEL_COMPLETE
(the folder to store the complete model).
Then the following commands create the language files and translate the data to CSV files to be imported into Neo4j.
bunzip2 -c $DATASET > $DATASET_UNCOMPRESSED
cd $WD_HOME/neo4j-experiment-scripts/generate_csv
python lang-generator.py # Genarate language files.
python parser.py # Generate the complete model.
python parser-light.py # Generate the model without language labels.
15. Configure Neo4J
We used Neo4J-community-2.3.1 with Java 7, using the distribution available in the system (OpenJDK Runtime Environment (IcedTea 2.6.6)).
Two databases were tested: one with all the data of the JSON dump and the other without the Labels, Aliases and Descriptions of Wikidata entities, in order to make the nodes lighter.
We created indexes on labels :Item(id)
, :Property(id)
and :Entity(id)
so as to map from entity ids (e.g., Q42) and property ids (e.g., P1432) to
their respective nodes.
The following variables are set in conf/neo4j-wrapper.conf
; a 20GB heap:
wrapper.java.initmemory = 20480
wrapper.java.maxmemory = 20480
Also we set the open file descriptor limit to 40,000 as is recommended in the Neo4j documentation.
16. Load the data in Neo4j
At the end of the parsing process several CSV files are generated in
the corresponding folders ($MODEL_LIGHT
and $MODEL_COMPLETE
).
You can load these files into Neo4j using the neo4j-import
command. In the following example we assume that $CSV_PATH
is one of
$MODEL_LIGHT
and $MODEL_COMPLETE
. Also we assume that $DB_PATH
is the folder where the Neo4j database will be generated (which also
contains the Neo4j application). You must use a different destination
for each model.
$DB_PATH/bin/neo4j-import \
--into $DB_PATH/data/graph.db \
--nodes $CSV_PATH/entity.csv \
--nodes:String:Value $CSV_PATH/string.csv \
--nodes:Time:Value $CSV_PATH/time.csv \
--nodes:Quantity:Value $CSV_PATH/quantity.csv \
--nodes:Qualifier $CSV_PATH/qualifiers.csv \
--nodes:Reference $CSV_PATH/references.csv \
--nodes:Claim $CSV_PATH/claims.csv \
--nodes:Url:Value $CSV_PATH/url.csv \
--nodes:MonolingualText:Value $CSV_PATH/monolingual.csv \
--nodes:GlobeCoordinate:Value $CSV_PATH/globe.csv \
--nodes:CommonsMedia:Value $CSV_PATH/commons.csv \
--relationships $CSV_PATH/relationships.csv \
--bad-tolerance 999999999
After loading the data, indexes are created using the Neo4j console.
CREATE INDEX ON :Entity(id);
CREATE INDEX ON :Item(id);
CREATE INDEX ON :Property(id);
17. Run experiments in Neo4j
The Neo4j benchmarks are executed with the following commands:
$ cd $WD_HOME/neo4j-experiment-scripts/quins
$ execute_quin_queries.sh
$ cd $WD_HOME/neo4j-experiment-scripts/paths
$ path1.sh
$ path2.sh
These commands assume that the database files are in the folders
$DB_1
and $DB_2
. Both folders have exactly the same data. We
alternate the databases to avoid caching effects produced by the
caching policies of the operating system.
Results
Raw results of query runtimes are published in the folder results
in the code repository.
In what follows, we present the elapsed times and the size of the database after loading the data.
Loading times and database sizes
RDF data
The following table present the statistics of data in each schema.
Model | Statements |
---|---|
naryrel | 563678588 |
ngraphs | 482371357 |
sgprop | 563676547 |
stdreif | 644981737 |
The following table presents the loading times of Virtuoso for each schema.
naryrel | ngraphs | sgprop | stdreif | |
---|---|---|---|---|
Loading files (s) | 8265 | 10844 | 8344 | 8818 |
Indexing (s) | 5701 | 4023 | 5514 | 5118 |
Total (s) | 13966 | 14867 | 13858 | 13936 |
Times and data sizes used by Virtuoso are summarized in the following table.
Model | Elapsed time | Size | ||
---|---|---|---|---|
naryrel | 13966 s | 3.9 h | 49169825792 bytes | 46G |
ngraphs | 14867 s | 4.1 h | 50413436928 bytes | 47G |
sgprop | 13858 s | 3.9 h | 49715085312 bytes | 47G |
stdreif | 13936 s | 3.9 h | 49027219456 bytes | 46G |
The following table presents the loading times and space used by Blazegraph for each schema.
Model | Elapsed time | Size | ||
---|---|---|---|---|
naryrel | 105693.640 s | 23.4 h | 65205960704 bytes | 61G |
ngraphs | 240486.333 s | 66.8 h | 127813812224 bytes | 120G |
sgprop | 93865.713 s | 26.1 h | 65205960704 bytes | 61G |
stdreif | 107242.764 s | 29.8 h | 65205960704 bytes | 61G |
Note that in the case of the named graphs schema, we use a quad store backend. On the other hand, in the other schemas we use a triple store backend.
Neo4j
The size of the database with labels before creating the indexes is of 47 GB. It contains:
- 214348455 nodes.
- 435080654 relationships.
- 661150129 properties.
The size of the database without the indexes is 32 GB. It contains:
- 214348455 nodes
- 435080654 relationships
- 409258592 properties
The times used to create the indexes are presented in the table below.
Index | Time |
---|---|
Entities | 5min 20s |
Item | 5min 10s |
Property | 1min 20s |
PostgreSQL
Task | Elapsed time |
---|---|
Loading data | 5184.8438s |
Index generation | 2065.7545s |
The following tables present the size used for each table in PostgreSQL.
table_name | table_size | indexes_size | total_size |
---|---|---|---|
“public”.”claims” | 13 GB | 11 GB | 24 GB |
“public”.”descriptions” | 10 GB | 7616 MB | 18 GB |
“public”.”labels” | 5882 MB | 4873 MB | 11 GB |
“public”.”entities” | 818 MB | 574 MB | 1392 MB |
“public”.”aliases” | 612 MB | 474 MB | 1087 MB |
“public”.”qualifiers” | 480 MB | 235 MB | 715 MB |
“pg_catalog”.”pg_depend” | 384 kB | 424 kB | 808 kB |
“pg_catalog”.”pg_proc” | 496 kB | 256 kB | 752 kB |
“pg_catalog”.”pg_attribute” | 408 kB | 224 kB | 632 kB |
“pg_catalog”.”pg_statistic” | 464 kB | 40 kB | 504 kB |