Skip to main content

Introduction

The Data Science Research Infrastructure is a cluster of servers to deploy workspaces and applications for Data Science.

It works by starting workspaces and applications in Docker containers that are automatically deployed to a powerful server on the cluster using Kubernetes, a container orchestration system. You can then access your workspace or application through an URL automatically generated.

Getting started

✅ What can be done on the DSRI

The DSRI is particularly useful if you need to:

  • Gain access to more computing resources (memory and CPUs), which enables you to load larger amount of data, or use more threads for parallelized tasks
  • Run jobs that takes a long time to complete
  • Deploy any database or service you need, and connect to it from your workspace easily
  • Book and start a workspace that uses one of our GPUs

The DSRI proposes a number of popular workspaces to work with data:

  • Multiple flavors of JupyterLab (scipy, tensorflow, all-spark, and more)
  • VisualStudio Code server (also available within the JupyterLab workspaces)
  • RStudio, with a complementary Shiny server
  • Matlab
  • Ubuntu Desktop

You can then install anything you want in your workspace using conda, pip, or apt.

Data storage

DSRI is a computing infrastructure, built and used to run data science workloads. DSRI stores data in a persistent manner, but all data stored on the DSRI is susceptible to be altered by the workloads you are running, and we cannot guarantee its immutability.

Always keep a safe copy of your data outside the DSRI. And don't rely on the DSRI for long term storage.

❌ What cannot be done

  • Since DSRI can only be accessed when using the UM VPN, deployed services will not be available on the public Internet 🔒
  • All activities must be legal in basis. You must closely examine and abide by the terms and conditions of any data, software, or web service that you use as part of your work 📜
  • You cannot reach data or servers hosted at Maastricht University from the DSRI by default. You will need to request access in advance here 📬️
  • Right now it is not possible to reach the central UM fileservices (MFS)
Request an account

If you are working at Maastricht University, see this page to request an account, and run your services on the DSRI.

The DSRI architecture

Here is a diagram providing a simplified explanation of how the DSRI works, using popular data science applications as examples (JupyterLab, RStudio, VSCode server)

DSRI in a nutshell

The DSRI specifications

Software

We use OKD 4.11, the Origin Community Distribution of Kubernetes that powers RedHat OpenShift, a distribution of the Kubernetes container orchestration tool. Kubernetes takes care of deploying the Docker containers on the cluster of servers, the OKD distribution extends it to improve security, and provide a user-friendly web UI to manage your applications.

We use RedHat Ceph storage for the distributed storage.

Hardware

  • 16 CPU nodes
RAM (GB)CPU (cores)Storage (TB)
Node capacity512 GB64 cores (128 threads)120 TB
Total capacity8 192 GB1 024 cores1 920 TB
  • 1 GPU node: Nvidia DGX1 8x Tesla V100 - 32GB GPU
GPUsRAM (GB)CPU (cores)
GPU node capacity8512 GB40 cores
DSRI infrastructure

Learn more about DSRI

See the following presentation about the Data Science Research Infrastructure

DSRI April 2021 Community Event Presentation