Data, Model & Software Releases

As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora, models and software created during the project. These corpora are also available at OPUS (http://opus.nlpl.eu/GoURMET.php) and Translation Models (including a “how to use them”) are available here on GitHub.

Table of Contents

Corpora

English–Swahili parallel corpus and Swahili monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip

English–Turkish parallel corpus and Turkish monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip

English–Amharic parallel corpus and Amharic monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip

English–Kyrgyz parallel corpus and Kygryz monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip

Kyrgyz–Russian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip

English–Serbian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.sr-en.zip

English–Serbo-Croatian parallel corpora
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.hbs-en.zip

Parallel and monolingual corpora of languages of India
http://data.statmt.org/pmindia/

Monolingual News Crawl
http://data.statmt.org/news-crawl

English–Macedonian parallel corpus and Macedonian monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-mk.zip

English–Yoruba parallel corpus and Yoruba monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-yo.zip

English–Burmese parallel corpus and Burmese monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-my.zip

English–Pastho parallel corpus and Pastho monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ps.zip

English–Igbo parallel corpus and Igbo monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ig.zip

English–Hausa parallel corpus and Hausa monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ha.zip

Tigrinya monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ti.zip

OPUS-100 corpus
(An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)
https://github.com/EdinburghNLP/opus-100-corpus

Translation Models

The repository covers seventeen languages: Bulgarian, Gujarati, Swahili Turkish, Tamil, Serbian, Hausa, Igbo, Pashto, Turkish, Amharic, Kyrgyz, Macedonian, Urdu, Myanmar and Yoruba. Get the list here:
http://data.statmt.org/gourmet/models/ and https://github.com/EdinburghNLP/gourmet-models


Bulgarian to English	0.1	http://data.statmt.org/gourmet/models/docker/bg-en.20190801.tgz
English to Bulgarian	0.2	http://data.statmt.org/gourmet/models/docker/en-bg.v0.2.tgz
Gujarati to English	0.1	http://data.statmt.org/gourmet/models/docker/gu-en.20190628.tgz
English to Gujarati	0.2	http://data.statmt.org/gourmet/models/docker/en-gu.v0.2.tgz
Swahili to English	0.5	https://data.statmt.org/gourmet/models/docker/translation-sw-en-0-5-0.docker.gz
English to Swahili	0.5	https://data.statmt.org/gourmet/models/docker/translation-en-sw-0-5-0.docker.gz
Tamil to English	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-ta-en.tar.gz
English to Tamil	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-en-ta.tar.gz
Serbian to English	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-sr-en.tar.gz
English to Serbian (Cyrilic)	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-en-sr.cyr.tar.gz
English to Serbian (Latin)	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-en-sr.lat.tar.gz
Hausa to English	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-ha-en.v0.2.tar.gz
English to Hausa	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-en-ha.v0.2.tar.gz
Igbo to English	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-ig-en.v0.2.tar.gz
English to Igbo	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-en-ig.v0.2.tar.gz
Pashto to English	0.4.1	https://data.statmt.org/gourmet/models/docker/translation-ps-en-0-4-1.docker.gz
English to Pashto	0.4.1	https://data.statmt.org/gourmet/models/docker/translation-en-ps-0-4-1.docker.gz
Turkish to English	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-tr-en.tgz
English to Turkis	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-en-tr.tgz
Turkish to English	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-v2-tr-en.tgz
English to Turkish	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-v2-en-tr.tgz
Amharic to English	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-am-en.tgz
English to Amharic	0.1	http://data.statmt.org/gourmet/models/docker/mt-engine-en-am.tgz
Kyrgyz to English	0.1.1	https://data.statmt.org/gourmet/models/docker/translation-ky-en-0-1-1.docker.gz
English to Kyrgyz	0.1.1	https://data.statmt.org/gourmet/models/docker/translation-en-ky-0-1-1.docker.gz
Macedonian to English	0.1.1	https://data.statmt.org/gourmet/models/docker/translation-mk-en-0-1-1.docker.gz
English to Macedonian	0.1.2	https://data.statmt.org/gourmet/models/docker/translation-en-mk-0-1-2.docker.gz
Urdu to English	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-ur-en.v0.2.tar.gz
English to Urdu	0.2	http://data.statmt.org/gourmet/models/docker/mt-engine-en-ur.v0.2.tar.gz
Myanmar to English	0.1 Slower	http://data.statmt.org/gourmet/models/docker/translation-my-en-slower-0-1-0.docker.gz
English to Myanmar	0.1 Slower	http://data.statmt.org/gourmet/models/docker/translation-en-my-slower-0-1-0.docker.gz
Myanmar to English	0.1 Faster	http://data.statmt.org/gourmet/models/docker/translation-my-en-faster-0-1-0.docker.gz
English to Myanmar	0.2.3 Faster	https://data.statmt.org/gourmet/models/docker/translation-en-my-faster-0-2-3.docker.gz
Yoruba to English	0.1	https://data.statmt.org/gourmet/models/docker/mt-engine-yo-en.tgz
English to Yoruba	0.1	https://data.statmt.org/gourmet/models/docker/mt-engine-en-yo.tgz

Various Docker Modules
http://data.statmt.org/gourmet/models/docker/

Software

mBART pretraining of Marian models
https://github.com/transducens/smart-segmentation

Morphological segmentation using Apertium resources
https://github.com/transducens/smart-segmentation

Hierarchical decoding (word RNN and decoding of words character by character)
https://github.com/d-ataman/Char-NMT

Latent modelling of morphology in NMT
https://github.com/d-ataman/lmm

Neural n-to-m alignment models
https://github.com/Roxot/m-to-n-alignments

Deep latent variable models for language modelling
https://github.com/tom-pelsmaeker/deep-generative-lm

Interpretable text classifiers with sparse relaxations to discrete random variables
https://github.com/bastings/interpretable_predictions

Training data for document-level machine translation
https://github.com/radidd/Doc-substructure-NMT

Contrastive test sets for document-level machine translation
https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR

Auto-encoding variational neural machine translation
https://github.com/Roxot/AEVNMT.pt

Bayesian data analysis of NMT models
https://github.com/probabll/bda-nmt

LASER train (language-agnostic sentence embeddings): It reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings.
https://github.com/transducens/LASERtrain

LinguaCrawl: It is used to crawl top-level domains. It has been completely developed within the GoURMET project and is compatible with Bitextor, so the data crawled with it can be processed with Bitextor.
https://github.com/transducens/linguacrawl/

WMT19 Gujarati system models and scripts
http://data.statmt.org/wmt19_systems/

Tool for fusing, extending and using language representations
https://github.com/aoncevay/multiview-langrep

Code for the improving massively multilingual NMT work
https://github.com/bzhangGo/zero

Code for the language model prior work
https://github.com/cbaziotis/lm-prior-for-nmt

Code for the auto-encoding variational NMT work
https://github.com/Roxot/AEVNMT

Bitextor: It is used to identify, align and clean parallel data by crawling web domains specified by the user.
https://github.com/bitextor/bitextor

It is developed in tight coordination with the Paracrawl project. Most contributions related to GoURMET focus on the addition of components that allow to improve the performance of the tool for under-resourced languages.

Bicleaner: It is used to filter noisy segment pairs in parallel corpora.
https://github.com/bitextor/bicleaner

We have contributed to the master branch and we are also working on another branch (https://github.com/bitextor/bicleaner/tree/bicleaner-0.14-NAACL20) that will eventually be merged with the master one.

Constrained optimisation for deep generative models in torch
https://github.com/EelcovdW/pytorch-constrained-opt.git

Probability distributions for torch including sparse relaxations to discrete random variables
https://github.com/probabll/dists.pt

Probabilistic modules for torch
https://github.com/probabll/dgm.pt

BayerSeq. Software package that implements the variational autoencoder models of sentences that we used for data augmentation.
https://github.com/probabll/dgm.pt

mtl-da Scripts for training machine translation systems using different data augmentation techniques in the target language.
https://github.com/vitaka/mtl-da

Evaluation Tools

Direct Assessment – Sentence Pairs Evaluation Tool
https://github.com/bbc/gourmet-sentence-pairs-evaluation

The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard

Gap Fill Evaluation Tool
https://github.com/bbc/gourmet-gap-fill-evaluation

The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.

Photo by Ankush Minda on Unsplash