Data & Software Releases

As part of GoURMET’s efforts to increase resources and tools available
for low-resource machine translation, we have released many of the
corpora and software created during the project. These corpora are also
available at OPUS (http://opus.nlpl.eu/GoURMET.php).
Corpora
English–Swahili parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip
English–Turkish parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip
English–Amharic parallel corpus and Amharic monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip
English–Kyrgyz parallel corpus and Kygryz monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip
Kyrgyz–Russian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip
PMIndia – Parallel corpus of languages of India
http://data.statmt.org/pmindia/
English–Serbian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.sr-en.zip
English–Serbo-Croatian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.hbs-en.zip
Monolingual News Crawl
http://data.statmt.org/news-crawl
OPUS-100 corpus
(An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)
https://github.com/EdinburghNLP/opus-100-corpus
Translation Models
The repository currently covers seven languages: Bulgarian, Turkish, Swahili, Kyrgyz, Serbian, Amharic and Gujarati
http://data.statmt.org/gourmet/models/
Amharic-English
http://data.statmt.org/gourmet/models/am-en/20200630/mt-engine-am-en.tgz
Bulgarian-English
http://data.statmt.org/gourmet/models/bg-en/
English-Bulgarian
http://data.statmt.org/gourmet/models/en-bg/
Gujarati-English
http://data.statmt.org/gourmet/models/gu-en/20190628/
English-Gujarati
http://data.statmt.org/gourmet/models/en-gu/
Serbian-English
http://data.statmt.org/gourmet/models/sr-en/20200411/
English-Serbian (Cyr.)
http://data.statmt.org/gourmet/models/en-sr.cyr/20200411/
English-Serbian (Lat.)
http://data.statmt.org/gourmet/models/en-sr.lat/20200411/
Turkish-English
http://data.statmt.org/gourmet/models/tr-en/20200630/mt-engine-tr-en.tgz
English-Turkish
http://data.statmt.org/gourmet/models/en-tr/
Various Docker Modules
http://data.statmt.org/gourmet/models/docker/
Software
mBART pretraining of Marian models
https://github.com/transducens/smart-segmentation
Morphological segmentation using Apertium resources
https://github.com/transducens/smart-segmentation
Hierarchical decoding (word RNN and decoding of words character by character)
https://github.com/d-ataman/Char-NMT
Latent modelling of morphology in NMT
https://github.com/d-ataman/lmm
Neural n-to-m alignment models
https://github.com/Roxot/m-to-n-alignments
Deep latent variable models for language modelling
https://github.com/tom-pelsmaeker/deep-generative-lm
Interpretable text classifiers with sparse relaxations to discrete random variables
https://github.com/bastings/interpretable_predictions
Training data for document-level machine translation
https://github.com/radidd/Doc-substructure-NMT
Contrastive test sets for document-level machine translation
https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR
Auto-encoding variational neural machine translation
https://github.com/Roxot/AEVNMT.pt
Bayesian data analysis of NMT models
https://github.com/probabll/bda-nmt
LASER train (language-agnostic sentence embeddings)
https://github.com/transducens/LASERtrain
LinguaCrawl: Top-level-domain crawler
https://github.com/transducens/linguacrawl/
WMT19 Gujarati system models and scripts
http://data.statmt.org/wmt19_systems/
Tool for fusing, extending and using language representations
https://github.com/aoncevay/multiview-langrep
Code for the improving massively multilingual NMT work
https://github.com/bzhangGo/zero
Code for the language model prior work
https://github.com/cbaziotis/lm-prior-for-nmt
Code for the auto-encoding variational NMT work
https://github.com/Roxot/AEVNMT
Bitextor: crawling of parallel corpora from the web
https://github.com/bitextor/bitextor
Bicleaner: detecting noisy sentence pairs in a parallel corpus
https://github.com/bitextor/bicleaner
Constrained optimisation for deep generative models in torch
https://github.com/EelcovdW/pytorch-constrained-opt.git
Probability distributions for torch including sparse relaxations to discrete random variables
https://github.com/probabll/dists.pt
Probabilistic modules for torch
https://github.com/probabll/dgm.pt
Evaluation Tools
Direct Assessment – Sentence Pairs Evaluation Tool
https://github.com/bbc/gourmet-sentence-pairs-evaluation
The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard
Gap Fill Evaluation Tool
https://github.com/bbc/gourmet-gap-fill-evaluation
The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.
Photo by Ankush Minda on Unsplash