Data & Software Releases

As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora and software created during the project. 

Corpora

English–Swahili parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip

English–Turkish parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip

English–Amharic parallel corpus and Amharic monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip

English–Kyrgyz parallel corpus and Kygryz monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip

Kyrgyz–Russian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip

PMIndia – Parallel corpus of languages of India
http://data.statmt.org/pmindia/

English–Serbian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.sr-en.zip

English–Serbo-Croatian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.hbs-en.zip

Monolingual News Crawl
http://data.statmt.org/news-crawl

OPUS-100 corpus
(An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)
https://github.com/EdinburghNLP/opus-100-corpus

Translation Models

The repository currently covers seven languages: Bulgarian, Turkish, Swahili, Kyrgyz, Serbian, Amharic and Gujarati
http://data.statmt.org/gourmet/models/

Amharic-English
http://data.statmt.org/gourmet/models/am-en/20200630/mt-engine-am-en.tgz

Bulgarian-English
http://data.statmt.org/gourmet/models/bg-en/

English-Bulgarian
http://data.statmt.org/gourmet/models/en-bg/

Gujarati-English
http://data.statmt.org/gourmet/models/gu-en/20190628/

English-Gujarati
http://data.statmt.org/gourmet/models/en-gu/

Serbian-English
http://data.statmt.org/gourmet/models/sr-en/20200411/

English-Serbian (Cyr.)
http://data.statmt.org/gourmet/models/en-sr.cyr/20200411/

English-Serbian (Lat.)
http://data.statmt.org/gourmet/models/en-sr.lat/20200411/

Turkish-English
http://data.statmt.org/gourmet/models/tr-en/20200630/mt-engine-tr-en.tgz

English-Turkish
http://data.statmt.org/gourmet/models/en-tr/

Various Docker Modules
http://data.statmt.org/gourmet/models/docker/

Software

Morphological segmentation using Apertium resources
https://github.com/transducens/smart-segmentation

Hierarchical decoding (word RNN and decoding of words character by character)
https://github.com/d-ataman/Char-NMT

Latent modelling of morphology in NMT
https://github.com/d-ataman/lmm

Neural n-to-m alignment models
https://github.com/Roxot/m-to-n-alignments

Deep latent variable models for language modelling
https://github.com/tom-pelsmaeker/deep-generative-lm

Interpretable text classifiers with sparse relaxations to discrete random variables
https://github.com/bastings/interpretable_predictions

Training data for document-level machine translation
https://github.com/radidd/Doc-substructure-NMT

Contrastive test sets for document-level machine translation
https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR

Auto-encoding variational neural machine translation
https://github.com/Roxot/AEVNMT.pt

Bayesian data analysis of NMT models
https://github.com/probabll/bda-nmt

LASER train (language-agnostic sentence embeddings)
https://github.com/transducens/LASERtrain

LinguaCrawl: Top-level-domain crawler
https://github.com/transducens/linguacrawl/

WMT19 Gujarati system models and scripts
http://data.statmt.org/wmt19_systems/

Tool for fusing, extending and using language representations
https://github.com/aoncevay/multiview-langrep

Code for the improving massively multilingual NMT work
https://github.com/bzhangGo/zero

Code for the language model prior work
https://github.com/cbaziotis/lm-prior-for-nmt

Code for the auto-encoding variational NMT work
https://github.com/Roxot/AEVNMT

Bitextor: crawling of parallel corpora from the web
https://github.com/bitextor/bitextor

Bicleaner: detecting noisy sentence pairs in a parallel corpus
https://github.com/bitextor/bicleaner

Constrained optimisation for deep generative models in torch
https://github.com/EelcovdW/pytorch-constrained-opt.git

Probability distributions for torch including sparse relaxations to discrete random variables
https://github.com/probabll/dists.pt

Probabilistic modules for torch
https://github.com/probabll/dgm.pt

Evaluation Tools

Direct AssessmentSentence Pairs Evaluation Tool
https://github.com/bbc/gourmet-sentence-pairs-evaluation

The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard

Gap Fill Evaluation Tool
https://github.com/bbc/gourmet-gap-fill-evaluation

The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.

Photo by Ankush Minda on Unsplash