Data, Model & Software Releases

As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora, models and software created during the project. These corpora are also available at OPUS (http://opus.nlpl.eu/GoURMET.php) and Translation Models (including a “how to use them”) are available here on GitHub.

Table of Contents

    Corpora

    English–Swahili parallel corpus and Swahili monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-sw.zip

    English–Turkish parallel corpus and Turkish monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-tr.zip

    English–Amharic parallel corpus and Amharic monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-am.zip

    English–Kyrgyz parallel corpus and Kygryz monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ky.zip

    Kyrgyz–Russian parallel corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ky-ru.zip

    English–Serbian parallel corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.sr-en.zip

    English–Serbo-Croatian parallel corpora
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.hbs-en.zip

    Parallel and monolingual corpora of languages of India
    http://data.statmt.org/pmindia/

    Monolingual News Crawl
    http://data.statmt.org/news-crawl

    English–Macedonian parallel corpus and Macedonian monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-mk.zip

    English–Yoruba parallel corpus and Yoruba monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-yo.zip

    English–Burmese parallel corpus and Burmese monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-my.zip

    English–Pastho parallel corpus and Pastho monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ps.zip

    English–Igbo parallel corpus and Igbo monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ig.zip

    English–Hausa parallel corpus and Hausa monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.en-ha.zip

    Tigrinya monolingual corpus
    http://data.statmt.org/gourmet/corpora/GoURMET-crawled.ti.zip

    OPUS-100 corpus
    (An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)
    https://github.com/EdinburghNLP/opus-100-corpus

    Translation Models

    The repository covers seventeen languages: Bulgarian, Gujarati, Swahili Turkish, Tamil, Serbian, Hausa, Igbo, Pashto, Turkish, Amharic, Kyrgyz, Macedonian, Urdu, Myanmar and Yoruba. Get the list here:
    http://data.statmt.org/gourmet/models/ and https://github.com/EdinburghNLP/gourmet-models

    Bulgarian to English0.1http://data.statmt.org/gourmet/models/docker/bg-en.20190801.tgz
    English to Bulgarian0.2http://data.statmt.org/gourmet/models/docker/en-bg.v0.2.tgz
    Gujarati to English0.1http://data.statmt.org/gourmet/models/docker/gu-en.20190628.tgz
    English to Gujarati0.2http://data.statmt.org/gourmet/models/docker/en-gu.v0.2.tgz
    Swahili to English0.5https://data.statmt.org/gourmet/models/docker/translation-sw-en-0-5-0.docker.gz
    English to Swahili0.5https://data.statmt.org/gourmet/models/docker/translation-en-sw-0-5-0.docker.gz
    Tamil to English0.1http://data.statmt.org/gourmet/models/docker/mt-engine-ta-en.tar.gz
    English to Tamil0.1http://data.statmt.org/gourmet/models/docker/mt-engine-en-ta.tar.gz
    Serbian to English0.1http://data.statmt.org/gourmet/models/docker/mt-engine-sr-en.tar.gz
    English to Serbian (Cyrilic)0.1http://data.statmt.org/gourmet/models/docker/mt-engine-en-sr.cyr.tar.gz
    English to Serbian (Latin)0.1http://data.statmt.org/gourmet/models/docker/mt-engine-en-sr.lat.tar.gz
    Hausa to English0.2http://data.statmt.org/gourmet/models/docker/mt-engine-ha-en.v0.2.tar.gz
    English to Hausa0.2http://data.statmt.org/gourmet/models/docker/mt-engine-en-ha.v0.2.tar.gz
    Igbo to English0.2http://data.statmt.org/gourmet/models/docker/mt-engine-ig-en.v0.2.tar.gz
    English to Igbo0.2http://data.statmt.org/gourmet/models/docker/mt-engine-en-ig.v0.2.tar.gz
    Pashto to English0.4.1https://data.statmt.org/gourmet/models/docker/translation-ps-en-0-4-1.docker.gz
    English to Pashto0.4.1https://data.statmt.org/gourmet/models/docker/translation-en-ps-0-4-1.docker.gz
    Turkish to English0.1http://data.statmt.org/gourmet/models/docker/mt-engine-tr-en.tgz
    English to Turkis0.1http://data.statmt.org/gourmet/models/docker/mt-engine-en-tr.tgz
    Turkish to English0.2http://data.statmt.org/gourmet/models/docker/mt-engine-v2-tr-en.tgz
    English to Turkish0.2http://data.statmt.org/gourmet/models/docker/mt-engine-v2-en-tr.tgz
    Amharic to English0.1http://data.statmt.org/gourmet/models/docker/mt-engine-am-en.tgz
    English to Amharic0.1http://data.statmt.org/gourmet/models/docker/mt-engine-en-am.tgz
    Kyrgyz to English0.1.1https://data.statmt.org/gourmet/models/docker/translation-ky-en-0-1-1.docker.gz
    English to Kyrgyz0.1.1https://data.statmt.org/gourmet/models/docker/translation-en-ky-0-1-1.docker.gz
    Macedonian to English0.1.1https://data.statmt.org/gourmet/models/docker/translation-mk-en-0-1-1.docker.gz
    English to Macedonian0.1.2https://data.statmt.org/gourmet/models/docker/translation-en-mk-0-1-2.docker.gz
    Urdu to English0.2http://data.statmt.org/gourmet/models/docker/mt-engine-ur-en.v0.2.tar.gz
    English to Urdu0.2http://data.statmt.org/gourmet/models/docker/mt-engine-en-ur.v0.2.tar.gz
    Myanmar to English0.1 Slowerhttp://data.statmt.org/gourmet/models/docker/translation-my-en-slower-0-1-0.docker.gz
    English to Myanmar0.1 Slowerhttp://data.statmt.org/gourmet/models/docker/translation-en-my-slower-0-1-0.docker.gz
    Myanmar to English0.1 Fasterhttp://data.statmt.org/gourmet/models/docker/translation-my-en-faster-0-1-0.docker.gz
    English to Myanmar0.2.3 Fasterhttps://data.statmt.org/gourmet/models/docker/translation-en-my-faster-0-2-3.docker.gz
    Yoruba to English0.1https://data.statmt.org/gourmet/models/docker/mt-engine-yo-en.tgz
    English to Yoruba0.1https://data.statmt.org/gourmet/models/docker/mt-engine-en-yo.tgz

    Various Docker Modules
    http://data.statmt.org/gourmet/models/docker/

    Software

    mBART pretraining of Marian models
    https://github.com/transducens/smart-segmentation

    Morphological segmentation using Apertium resources
    https://github.com/transducens/smart-segmentation

    Hierarchical decoding (word RNN and decoding of words character by character)
    https://github.com/d-ataman/Char-NMT

    Latent modelling of morphology in NMT
    https://github.com/d-ataman/lmm

    Neural n-to-m alignment models
    https://github.com/Roxot/m-to-n-alignments

    Deep latent variable models for language modelling
    https://github.com/tom-pelsmaeker/deep-generative-lm

    Interpretable text classifiers with sparse relaxations to discrete random variables
    https://github.com/bastings/interpretable_predictions

    Training data for document-level machine translation
    https://github.com/radidd/Doc-substructure-NMT

    Contrastive test sets for document-level machine translation
    https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR

    Auto-encoding variational neural machine translation
    https://github.com/Roxot/AEVNMT.pt

    Bayesian data analysis of NMT models
    https://github.com/probabll/bda-nmt

    LASER train (language-agnostic sentence embeddings): It reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings.
    https://github.com/transducens/LASERtrain

    LinguaCrawl: It is used to crawl top-level domains. It has been completely developed within the GoURMET project and is compatible with Bitextor, so the data crawled with it can be processed with Bitextor.
    https://github.com/transducens/linguacrawl/

    WMT19 Gujarati system models and scripts
    http://data.statmt.org/wmt19_systems/

    Tool for fusing, extending and using language representations
    https://github.com/aoncevay/multiview-langrep

    Code for the improving massively multilingual NMT work
    https://github.com/bzhangGo/zero

    Code for the language model prior work
    https://github.com/cbaziotis/lm-prior-for-nmt

    Code for the auto-encoding variational NMT work
    https://github.com/Roxot/AEVNMT

    Bitextor: It is used to identify, align and clean parallel data by crawling web domains specified by the user.
    https://github.com/bitextor/bitextor

    It is developed in tight coordination with the Paracrawl project. Most contributions related to GoURMET focus on the addition of components that allow to improve the performance of the tool for under-resourced languages.

    Bicleaner: It is used to filter noisy segment pairs in parallel corpora.
    https://github.com/bitextor/bicleaner

    We have contributed to the master branch and we are also working on another branch (https://github.com/bitextor/bicleaner/tree/bicleaner-0.14-NAACL20) that will eventually be merged with the master one.

    Constrained optimisation for deep generative models in torch
    https://github.com/EelcovdW/pytorch-constrained-opt.git

    Probability distributions for torch including sparse relaxations to discrete random variables
    https://github.com/probabll/dists.pt

    Probabilistic modules for torch
    https://github.com/probabll/dgm.pt

    BayerSeq. Software package that implements the variational autoencoder models of sentences that we used for data augmentation.
    https://github.com/probabll/dgm.pt

    mtl-da Scripts for training machine translation systems using different data augmentation techniques in the target language.
    https://github.com/vitaka/mtl-da

    Evaluation Tools

    Direct AssessmentSentence Pairs Evaluation Tool
    https://github.com/bbc/gourmet-sentence-pairs-evaluation

    The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard

    Gap Fill Evaluation Tool
    https://github.com/bbc/gourmet-gap-fill-evaluation

    The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.

    Photo by Ankush Minda on Unsplash