Build a FAIR Knowledge Graph

March 18, 2021 · 12 min read

Vincent Emonet

Data Science engineer at IDS

This tutorial will help you generate a knowledge graph from some structured datasets such as CSV, JSON and XML. We will use Semantic Web technologies to build our knowledge graph. If you are not familiar with Semantic Web technologies, we provide you with some basic concepts and technologies about Semantic Web based Knowledge Graphs.

Knowledge Graph 101#

Back in 1999 the W3C laid down the Semantic Web principles, an utopia where all data published on the web would be published in a standard manner, interconnected, and shared. To live the dream, a standard way to describe and link knowledge graphs has been defined: the Resource Description Framework (RDF).

There are multiple definitions to a knowledge graph, most of them come down to being nodes connected by edges. The Semantic Web/RDF stack provides standards to build interconnected knowledge graphs at the web scale. This article will focus on using those standards.

Multiple concepts have been defined since then, to provide a structured and coherent framework for knowledge representations (this goes further than just building graphs and networks!). Here is a quick glossary of some of concepts that will be used in this tutorial 📖

URI, aka. URL: they are used to identify concepts, properties and entities. e.g. https:///w3id.org/mykg/MyConcept
RDF: the standard knowledge graph format to describe nodes and edges as triples (subject, predicate, object). e.g. "MyConcept is a Restaurant"

<https:///w3id.org/mykg/MyConcept> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https:///w3id.org/mykg/Restaurant> .  

Ontologies, aka. vocabularies: the data models for knowledge graphs, they describe their classes and properties. e.g. the class https://schema.org/Restaurant in Schema.org vocabulary
SPARQL: the language to query RDF Knowledge Graphs stored in triplestores (database for RDF) through a SPARQL endpoint
RML: the RDF Mapping Language is a language, based on the R2RML W3C recommendation, to map any structured data (CSV, JSON, XML, SQL) to RDF. Mappings themselves are expressed in RDF.
YARRRML: a simplification of RML to define mappings in the human-readable YAML format, easier to write than the RML RDF.

Let's get started#

This short tutorial aims to build a RDF Knowledge Graph about restaurants and cuisines from 2 tabular files samples generated from a dataset found on data.world, using mappings expressed in the popular human-readable YAML format.

We will use tools, data models and formats based on recommendations from the World Wide Web Consortium (W3C) to build a knowledge graph as "FAIR" as possible (Findable, Accessible, Interoperable, Reusable).

This guide consists of the following steps:

Search for relevant ontologies/vocabularies to describe your knowledge graph content.
Properly deal with using multiple ontologies, and describing missing properties.
Map CSV files to a coherent RDF knowledge graph, and generate the RDF knowledge graph.

Choose a data model#

We first need to define a data model for our Knowledge Graph. A knowledge graph data model consists of concepts and properties, defined in an ontology, or vocabulary.

Choosing the right concepts and properties for your Knowledge Graph from existing and recognized ontologies is the most important part of the process to publish data in a standard and reusable manner. If your knowledge graph uses the same ontology (concepts and properties) as other knowledge graphs, it will be easier to connect your graph to existing data published in those knowledge graphs.

Using existing ontologies connects your graph to existing data published using the same ontologies. Using your own URIs for properties and entities types makes your RDF Knowledge Graph isolated, and completely miss the point of building a knowledge graph. Whereas if you use your own URIs for properties and entities types, your RDF Knowledge Graph might be isolated, and not linked to other data resources.

Let's take a look at the 2 files we want to integrate in our knowledge graph.

dataworld-restaurants.csv has information about restaurants, and which cuisine they serve:

Restaurant ID	Restaurant Name	City	country name	Address	Locality	Locality Verbose	Longitude	Latitude	Cuisines	Average Cost for two	Currency	Has Table booking	Has Online delivery	Is delivering now	Switch to order menu	Price range	Aggregate rating	Rating color	Rating text	Votes
53	Amber	New Delhi	India	N-19, Connaught Place, New Delhi	Connaught Place	Connaught Place, New Delhi	77.220891	28.630197	indian	1800	Indian Rupees(Rs.)	Yes	Yes	No	No	3	2.6	Orange	Average	152

dataworld-cuisines.csv is slightly different from the original on Data.world as I ran a quick preprocessing step to concatenate the diets columns and make the mapping easier:

cuisines	diets
Chinese	vegetarian\|low_lactose_diet
indian	halal\|low_lactose_diet\|vegetarian

In this case we followed those steps to define our Knowledge Graph concepts and properties:

Schema.org is a good vocabulary built by Google which covers a large amount of concepts and properties. It is the standard vocabulary used by prominent Search Engines, such as Google search, to understand metadata. It covers most of the entities and properties we want to create for the columns of the dataworld-restaurants.csv file (such as schema:priceRange or schema:priceRange, see the mappings code to discover the complete list of properties used)
The "Cuisine" concept does not exist in Schema.org: the schema:cuisine points to a text, but in this case we want to create a new entity for cuisines, to better connect them. So we will search "food ontology cuisine" in Google, to find a more specific ontology about food. We find the fo:Cuisine concept from the Food Ontology published by the BBC.
Unfortunately there are no properties to express that a cuisine is recommended for a specific diet, schema:suitableForDiet and fo:diets can be use only on recipes entities, so we will need to create a new property to express it, mykg:recommendedForDiet

Another interesting ontology would have been FoodON, which is really detailed and well connected to other ontologies, but in our simple use-case the Cuisine concept from the BBC Food Ontology is good enough.

Note that multiple ontologies repositories can be also consulted to search for relevant concepts and ontologies:

Linked Open Vocabularies (LOV)
BioPortal repository for biomedical ontologies
EBI Ontology Lookup Service, also for biomedical ontologies
AgroPortal repository for agronomy ontologies

Build the knowledge graph#

We will use the Matey web editor to execute YARRRML mappings to convert 2 small CSV files. The operation can be scaled up to larger files by using the RMLMapper Java implementation, or the RMLStreamer for really large files.

Go to https://rml.io/yarrrml/matey 🦜
Create 2 new datasources for the dataworld-restaurants.csv and dataworld-cuisines.csv files. Click on the Input: tab on the left of the Matey web UI to create new datasources.

dataworld-restaurants.csv:

Restaurant ID,Restaurant Name,City,country name,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,Average Cost for two,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
53,Amber,New Delhi,India,"N-19, Connaught Place, New Delhi",Connaught Place,"Connaught Place, New Delhi",77.220891,28.630197,indian,1800,Indian Rupees(Rs.),Yes,Yes,No,No,3,2.6,Orange,Average,152
55,Berco's,New Delhi,India,"G-2/43, Middle Circle, Connaught Place, New Delhi",Connaught Place,"Connaught Place, New Delhi",77.217298,28.632452,Chinese,1100,Indian Rupees(Rs.),Yes,Yes,No,No,3,3.9,Yellow,Good,2639
60,Colonel's Kababz,New Delhi,India,"29, Defence Colony Market, Defence Colony, New Delhi",Defence Colony,"Defence Colony, New Delhi",77.230591,28.574036,indian,900,Indian Rupees(Rs.),Yes,No,No,No,2,3.2,Orange,Average,600

dataworld-cuisines.csv:

cuisines,diets
Chinese,vegetarian|low_lactose_diet
indian,halal|low_lactose_diet|vegetarian

Copy/paste those YARRRML mappings in the input: YARRRML box in the middle of the Matey web UI, we added comments to explain the different part of the mappings.

prefixes:
  # Prefixes allow to define shortcuts for the concepts/properties URI used in the mappings
  grel: "http://users.ugent.be/~bjdmeest/function/grel.ttl#"
  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
  owl: "http://www.w3.org/2002/07/owl#"
  schema: "https://schema.org/"
  fo: "http://purl.org/ontology/fo/"
  mykg: "https://w3id.org/mykg/"
  # We use w3id.org to have a persistent ID for our entities
mappings:
  restaurants:
    sources:
      - ['dataworld-restaurants.csv~csv']
    # We define the subject and graph of this mapping
    s: mykg:restaurant/$(Restaurant ID)
    g: mykg:graph/restaurants
    # Then we define the predicate/objects for this subject
    # aka. the properties/values for this entity
    po:
      - [a, schema:Restaurant]
      - [rdfs:label, $(Restaurant Name)]
      # We could also use schema:name
      - [schema:location, $(country name)]
      - [schema:address, $(Address)]
      - [schema:priceRange, $(Price range)]
      - [schema:currenciesAccepted, $(Currency)]
      - [schema:acceptsReservations, $(Has Table booking)]
      - [schema:aggregateRating, $(Aggregate rating)]
      # We link to cuisine here by creating the same URI (identifier):
      - [fo:cuisine, mykg:cuisine/$(Cuisines)~iri]
  cuisines:
    sources:
      - ['dataworld-cuisines.csv~csv']
    s: mykg:cuisine/$(cuisines)
    g: mykg:graph/cuisines
    po:
      - [a, fo:Cuisine]
      - [rdfs:label, $(cuisines)]
      # We use a function to split the "diets" cells using |
      - p: mykg:recommendedForDiet
        o:
          function: grel:string_split
          parameters:
          - [grel:p_string_sep, "\|"]
          - [grel:valueParameter, $(diets)]

Note that we use our own mykg: namespace for the Knowledge Graph entities URIs (the restaurants and cuisines entities generated). But we use already existing ontology terms for the properties and entity classes.

We used the free w3id.org persistent ID system to define our mykg: URI. We did not created a new entry in w3id.org since this is an example, but this can be easily done with a simple pull request to their GitHub repository.

Click on Generate RML, then Generate LD buttons. The conversion should take around 10s.

Here is a screenshot of what you should see after converting the data, the RDF Knowledge Graph has been generated in the Output: Turtle/TriG panel on the right.

Screenshot of Matey Web UI

To be more semantically accurate, we will need to add a few triples to describe some properties we use that are not described in Schema.org or the Food Ontology.

We need to indicate that entities with the type schema:Restaurant are also considered as fo:Restaurant. This improves the knowledge graph semantics, since the fo:cuisine is not defined as a property of schema:Restaurant in the Schema.org vocabulary. We can use the standard owl:equivalentClass property to indicate the 2 ontology classes are equivalent with a triple:
We also need to define the mykg:recommendedForDiet property used with fo:Cuisine entities, we will define it as a OWL property, with basic metadata about the property domain and range.

The missing statements can be easily defined as RDF in the Turtle format, and loaded with the RDF generated for your knowledge graph:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix schema: <https://schema.org/>.
@prefix mykg: <https://w3id.org/mykg/>.
@prefix fo: <http://purl.org/ontology/fo/>.

schema:Restaurant owl:equivalentClass fo:Restaurant .

mykg:recommendedForDiet a owl:DatatypeProperty;
    rdfs:label "Recommended for diet";
    rdfs:comment "The diets this cuisine recipes are usually recommended for";
    rdfs:domain fo:Cuisine;
    rdfs:range xsd:String.

Publish the RDF#

To be able to query your RDF Knowledge Graph, you will need to host it in a triplestore (a database for RDF data). There are a lot of different triplestore solutions out there, most of them expose a SPARQL endpoint which enable anyone to query your RDF Knowledge Graph using the SPARQL query language.

Host your own triplestore: if you have access to a server, and are savvy enough, you can easily deploy a triplestore for free such as OpenLink Virtuoso, Blazegraph, StarDog, or Ontotext GraphDB, and load your RDF Knowledge graph in it.
Another solution, for small pieces of RDF data, would be to publish your RDF as Nanopublications which is a decentralized network of triplestores using encrypted keys and ORCID accounts to authenticate publishers. This can be done easily with the nanopub-java library, or the nanopub Python package.
You can also pay for a cloud service provider to publish and expose RDF Knowledge Graphs, such as Amazon Neptune or Dydra

Next steps 🔮#

Publish the code used to generate the knowledge graph in a public Git repository (e.g. GitHub or GitLab). Provide all relevant information to reproduce the process in the README.md file.
You can add SPARQL queries examples ✨️ to help people who want to query your knowledge graph. Ideally store them in .rq files in a GitHub repository to automatically deploy OpenAPIs with a Swagger UI to query your knowledge graph with.
You can also create SHACL or ShEx scheme for your knowledge graph, this will allow you to validate the data added to your knowledge graph to make sure it meets your requirements. Additionally, this provides a precise human and machine readable description of the content of your knowledge graph, and the restrictions it complies with.
Scale up to transform larger files by using the RMLMapper Java implementation, or use the RMLStreamer to stream large files without relying on memory limitations.
Create new RML functions to handle more complex use-cases, such as performing replace in strings, or generating URI identifiers from string.
Use the w3id.org to define your URIs, then head to the w3id.org GitHub repository to define the redirection to resolve your knowledge graph URI in a .htaccess file. This can be easily changed later with a simple pull request if the data location changes. This system insure that your knowledge graph always use the same w3id.org URIs, even if the location and resolution system change.

Thanks to 🤝

The IDLab team at Ghent University for leading the definition of the RML and YARRRML specifications, developing the RML mapper, and deploying the Matey web editor.
Remzi Çelebi for his help finding the data, and building the knowledge graph.
The people all other the world who work to define and improve knowledge description standards, usually through the W3C.

Comments are welcome 💬

Propose improvements to this process
Point to interesting data sources that could be interesting for another article about building FAIR Knowledge Graph!
Ask for specific steps that could be more detailed in another article

Create an issue on GitHub to start the discussion 💬

Recent posts