data2services-pipeline

DEPRECATED. See http://d2s.semanticscience.org

DEPRECATED. See http://d2s.semanticscience.org

Get started

This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.


The Data2Services philosophy

Docker containers running with a few parameters (e.g. input file path, SPARQL endpoint, credentials, mapping file path)


Clone

git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git
cd data2services-pipeline

# Update all submodules
git submodule update --recursive --remote

Build

build.sh is a convenience script to build/pull all Docker images, but they can be built separately.

Build/pull all docker images:

# Don't forget to put GraphDB zip file in the graphdb folder
./build.sh

Start services

In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.

# Pull and start apache-drill
docker pull vemonet/apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
  --name drill -v /data:/data:ro \
  vemonet/apache-drill
# Build and start graphdb (don't forget to put the .zip file in the graphdb folder)
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
  -v /data/graphdb:/opt/graphdb/home \
  -v /data/graphdb-import:/root/graphdb-import \
  graphdb

Run using Docker commands

Download datasets

Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.

docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
  vemonet/data2services-download \
  --download-datasets drugbank,hgnc,date \
  --username my_login --password my_password \
  --clean # to delete all files in /data/data2services

Convert XML

Use xml2rdf to convert XML files to a generic RDF based on the file structure.

docker pull vemonet/xml2rdf
docker run --rm -it -v /data:/data \
  vemonet/xml2rdf  \
  -i "/data/data2services/myfile.xml.gz" \
  -o "/data/data2services/myfile.nq.gz" \
  -g "https://w3id.org/data2services/graph/xml2rdf"

Generate R2RML mapping file for TSV & RDB

We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.

docker pull vemonet/autor2rml
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
  vemonet/autor2rml \
  -j "jdbc:drill:drillbit=drill:31010" -r \
  -o "/data/data2services/mapping.trig" \
  -d "/data/data2services" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container 
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
  vemonet/autor2rml \
  -j "jdbc:postgresql://postgres:5432/my_database" -r \
  -o "/data/data2services/mapping.trig" \
  -u "postgres" -p "pwd" \
  -b "https://w3id.org/data2services/" \
  -g "https://w3id.org/data2services/graph/autor2rml"

Use R2RML mapping file to generate RDF

Generate the generic RDF using R2RML and the previously generated mapping.trig file.

docker pull vemonet/r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS

# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
  -v /data/data2services:/data \
  vemonet/r2rml /data/config.properties

Upload RDF

Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.

docker pull vemonet/rdf-upload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
  vemonet/rdf-upload \
  -m "HTTP" -if "/data" \
  -url "http://graphdb:7200" \
  -rep "test" \
  -un "import_user" -pw "PASSWORD"

Transform generic RDF to target model

Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model.

We will use the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository using variables to define the graphs URIs.

docker pull vemonet/data2services-sparql-operations

# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
  vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
  -ep "http://graphdb:7200/repositories/test/statements" \
  -un MYUSERNAME -pw MYPASSWORD \
  --var-output https://w3id.org/data2services/graph/biolink/uniprot

# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
  -f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
  -ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
  -un USERNAME -pw PASSWORD \
  --var-service http://localhost:7200/repositories/test \
  --var-input http://data2services/graph/xml2rdf/drugbank \
  --var-output https://w3id.org/data2services/graph/biolink/drugbank

Further documentation in the Wiki


Citing this work

If you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:

Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.

Bibtex entry:

@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}