Open Data Science Text and Data Analytical Workflows for Data-Driven Businesses and Civil Society



Machine Learning and Data Science are powerful instruments, but when applying them we need to avoid the "Type III Error", describing the situation of "Answering the wrong question", or "Solving the wrong problem". For me it means developing tools in close cooperation with involved stakeholders, thoroughly assessing their needs and then homing in on the issue that code and data really can solve.

I enjoy learning new things and applying them right away, so my solutions are guided by the newest scientific advances, empirical data, and current best practices.

My aim is to develop software as a tool to empower humans with the knowledge they need. This also means working as open as possible, and making code, data and processes re-usable, adaptable and shareable without barriers.

During my studies I recognized the need for increasing public access to knowledge, and the complementary need for tools that scale with the increasing amount of knowledge produced. Around 2013 I joined the Open Science Movement, where both issues are being tackled - on the one hand by advocating for Open Access and Open (Research) Data, on the other hand by creating an open ecosystem of tools and services.


Machine Learning | Deep Learning
Natural Language Processing
Graph Mining


Languages: Python | R
Frameworks: Apache Spark | Keras+Tensorflow


I enjoy developing data science workflows for research projects and non-governmental/non-profit organizations. If you have a specific use case or simply want to discuss an idea, drop me a line!

Open Knowledge Maps

A visual interface to the world's scientific knowledge

Open Knowledge Maps are creating a visual interface to the world's scientific knowledge that can be used by anyone in order to dramatically improve the discoverability of research results.

At Open Knowledge Maps I work on the algorithms that cluster and summarize the search results.

Try it out!


Related terms to cat: tgg, ggg, ggc, ttg, gac

Embedbot is the result of playing around with Apache Spark, word2vec and the Twitter API.

Embedbot replies to queries like "climate + change" with word embeddings generated on the EuropePMC Open Access corpus of 1.4 mio scientific papers.

Ask it something!


A workflow for a Digital Literary Network Analysis

Working group members are Frank Fischer, Mathias Göbel, Dario Kampkaspar, Hanna-Lena Meiners, Danil Skorinkin, and Peer Trilcke. We're looking into hundreds of dramatic texts ranging from Greek tragedies to 20th-century plays and work on larger German, French, English, and Russian corpora.

In dlina I work on a tool for network visualizations and network metrics.

Click here to learn more.


The Right To Read Is The Right To Mine

ContentMine develops open source software for mining the scientific literature and engage directly in supporting researchers to use mining, saving valuable time and opening up new research avenues.

At ContentMine I was responsible for downstream data analysis and visualizations for demonstration purposes.

Click here to learn more. Try a demo here!