Data & Model Releases

As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora and software created during the project. 

Corpora

English–Swahili parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip
English–Turkish parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip
English–Amharic parallel corpus and Amharic monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip

English–Kyrgyz parallel corpus and Kygryz monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip

Kyrgyz–Russian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip
PMIndia – Parallel corpus of languages of India
http://data.statmt.org/pmindia/

Software

Morphological segmentation using Apertium resources
https://github.com/transducens/smart-segmentation
LASER train (language-agnostic sentence embeddings)
https://github.com/transducens/LASERtrain
Top-level-domain crawler
https://github.com/transducens/linguacrawl/

Evaluation Tools

Direct AssessmentSentence Pairs Evaluation Tool
https://github.com/bbc/gourmet-sentence-pairs-evaluation

The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard

Gap Fill Evaluation Tool
https://github.com/bbc/gourmet-gap-fill-evaluation

The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.

Photo by Ankush Minda on Unsplash