Open Source Contributions

Hugging Face Hub

  • I’ve created and shared many different zeroshot text classifiers on the Hugging Face hub. These models can categorize text into any category without needing to be trained on the specific category. They have become popular because they are very efficient and work relatively well out of the box. My models have been downloaded +100 million times (Dec. 2021 - Feb. 2025) and I’m one of the top contributors on the HF hub.
  • Synthetic datasets:
    • The multilingual-NLI-26lang-2mil7 dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset was created by machine-translating popular NLI datasets. I created the dataset to improve multilingual performance of zero/few-shot classifiers. Interestingly enough, the dataset did not clearly improve performance, probably due to data noise and strong enough multilingual representations from pre-training in models like XLM-R or mDeBERTa.
    • I also created datasets like synthetic_zeroshot_mixtral_v0.1 (synthetic NLI data specifically designed for zeroshot classification) or dataset_train_nli (a concatenation of many different classification datasets) to improve the performance of zeroshot text classifiers.

GitHub

  • The code for reproducing my zeroshot classifiers is fully open-source. I’ve shared the recipe in Jupyter notebooks, which anyone can use to reproduce my models for less than €50~.
  • prompt-templates: prompt-templates is a simple Python library that facilitates and standardizes the sharing of prompt templates on the Hugging Face hub.
  • gpt-google-sheets: A small google sheets extension for running generative LLMs in google sheet cells, written in Google Apps Script (= JavaScript).
  • Different work for Hugging Face customers. E.g. litserve-huggingface-ie which provides guidance for running LitServe containers on HF Inference Endpoints.

Other projects

  • Teaching: I like sharing knowledge. For example, gave the course “Hands-on Transformers: Fine-Tune your own BERT and GPT” at Hertie School, Berlin. The code, slides and full video are available online.
  • DataMine Europe (2018 - 2021): I co-founded the NGO ‘DataMine Europe’ (initially ‘European Elections Stats’) as a free-time project to share open data on European elections. We produced data and visualizations that were used by leading French, German, Spanish, U.S. media. Our website is offline now, to avoid paying €100~/year for hosting.
  • CEPS Eurlex dataset: I co-created the dataset of 142 000+ legal texts with metadata to facilitate computational analyses of the European legal acquis.