Data & Model Releases

As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora and software created during the project. 

Corpora

English–Swahili parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip
English–Turkish parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip
English–Amharic parallel corpus and Amharic monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip

English–Kyrgyz parallel corpus and Kygryz monolingual corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip

Kyrgyz–Russian parallel corpus
http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip
PMIndia – Parallel corpus of languages of India
http://data.statmt.org/pmindia/

Software

Morphological segmentation using Apertium resources
https://github.com/transducens/smart-segmentation
LASER train (language-agnostic sentence embeddings)
https://github.com/transducens/LASERtrain
Top-level-domain crawler
https://github.com/transducens/linguacrawl/