Data & Software Releases

As part of GoURMET’s efforts to increase resources and tools available
for low-resource machine translation, we have released many of the
corpora and software created during the project. These corpora are also
available at OPUS (


English–Swahili parallel corpus

English–Turkish parallel corpus

English–Amharic parallel corpus and Amharic monolingual corpus

English–Kyrgyz parallel corpus and Kygryz monolingual corpus

Kyrgyz–Russian parallel corpus

PMIndia – Parallel corpus of languages of India

English–Serbian parallel corpus

English–Serbo-Croatian parallel corpus

Monolingual News Crawl

OPUS-100 corpus
(An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)

Translation Models

The repository currently covers seven languages: Bulgarian, Turkish, Swahili, Kyrgyz, Serbian, Amharic and Gujarati







English-Serbian (Cyr.)

English-Serbian (Lat.)



Various Docker Modules


mBART pretraining of Marian models

Morphological segmentation using Apertium resources

Hierarchical decoding (word RNN and decoding of words character by character)

Latent modelling of morphology in NMT

Neural n-to-m alignment models

Deep latent variable models for language modelling

Interpretable text classifiers with sparse relaxations to discrete random variables

Training data for document-level machine translation

Contrastive test sets for document-level machine translation

Auto-encoding variational neural machine translation

Bayesian data analysis of NMT models

LASER train (language-agnostic sentence embeddings)

LinguaCrawl: Top-level-domain crawler

WMT19 Gujarati system models and scripts

Tool for fusing, extending and using language representations

Code for the improving massively multilingual NMT work

Code for the language model prior work

Code for the auto-encoding variational NMT work

Bitextor: crawling of parallel corpora from the web

Bicleaner: detecting noisy sentence pairs in a parallel corpus

Constrained optimisation for deep generative models in torch

Probability distributions for torch including sparse relaxations to discrete random variables

Probabilistic modules for torch

Evaluation Tools

Direct AssessmentSentence Pairs Evaluation Tool

The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard

Gap Fill Evaluation Tool

The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.

Photo by Ankush Minda on Unsplash