DEPRECATED. See http://d2s.semanticscience.org
DEPRECATED. See http://d2s.semanticscience.org
This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.
Docker containers running with a few parameters (e.g. input file path, SPARQL endpoint, credentials, mapping file path)
git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git
cd data2services-pipeline
# Update all submodules
git submodule update --recursive --remote
build.sh
is a convenience script to build/pull all Docker images, but they can be built separately.
.zip
files in the ./graphdb
repositories.Build/pull all docker images:
# Don't forget to put GraphDB zip file in the graphdb folder
./build.sh
In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.
# Pull and start apache-drill
docker pull vemonet/apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
--name drill -v /data:/data:ro \
vemonet/apache-drill
# Build and start graphdb (don't forget to put the .zip file in the graphdb folder)
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
-v /data/graphdb:/opt/graphdb/home \
-v /data/graphdb-import:/root/graphdb-import \
graphdb
/data
repository has been granted in Docker configuration.docker-compose
./data
(to comply with Apache Drill shared volume)./data/data2services
as working directory (containing all the files, note that it is usually shared as /data
in the Docker containers).Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.
docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
vemonet/data2services-download \
--download-datasets drugbank,hgnc,date \
--username my_login --password my_password \
--clean # to delete all files in /data/data2services
Use xml2rdf to convert XML files to a generic RDF based on the file structure.
docker pull vemonet/xml2rdf
docker run --rm -it -v /data:/data \
vemonet/xml2rdf \
-i "/data/data2services/myfile.xml.gz" \
-o "/data/data2services/myfile.nq.gz" \
-g "https://w3id.org/data2services/graph/xml2rdf"
We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.
docker pull vemonet/autor2rml
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
vemonet/autor2rml \
-j "jdbc:drill:drillbit=drill:31010" -r \
-o "/data/data2services/mapping.trig" \
-d "/data/data2services" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
vemonet/autor2rml \
-j "jdbc:postgresql://postgres:5432/my_database" -r \
-o "/data/data2services/mapping.trig" \
-u "postgres" -p "pwd" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
Generate the generic RDF using R2RML and the previously generated mapping.trig
file.
docker pull vemonet/r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS
# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
-v /data/data2services:/data \
vemonet/r2rml /data/config.properties
Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.
docker pull vemonet/rdf-upload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
vemonet/rdf-upload \
-m "HTTP" -if "/data" \
-url "http://graphdb:7200" \
-rep "test" \
-un "import_user" -pw "PASSWORD"
Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model.
We will use the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository using variables to define the graphs URIs.
docker pull vemonet/data2services-sparql-operations
# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
-ep "http://graphdb:7200/repositories/test/statements" \
-un MYUSERNAME -pw MYPASSWORD \
--var-output https://w3id.org/data2services/graph/biolink/uniprot
# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
-ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
-un USERNAME -pw PASSWORD \
--var-service http://localhost:7200/repositories/test \
--var-input http://data2services/graph/xml2rdf/drugbank \
--var-output https://w3id.org/data2services/graph/biolink/drugbank
\
and make the docker run
command one line for Windows PowerShell.BETA
: RDF validation using ShExIf you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:
Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.
Bibtex entry:
@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}