Data, Model & Software Releases
As part of GoURMET’s efforts to increase resources and tools available for low-resource machine translation, we have released many of the corpora, models and software created during the project. These corpora are also available at OPUS (http://opus.nlpl.eu/GoURMET.php) and Translation Models (including a “how to use them”) are available here on GitHub.
English–Swahili parallel corpus
English–Turkish parallel corpus
English–Amharic parallel corpus and Amharic monolingual corpus
English–Kyrgyz parallel corpus and Kygryz monolingual corpus
Kyrgyz–Russian parallel corpus
PMIndia – Parallel corpus of languages of India
English–Serbian parallel corpus
English–Serbo-Croatian parallel corpus
Monolingual News Crawl
(An English-centric multilingual corpus covering 100 languages sampled from OPUS. All training pairs include English on either the source or target side.)
The repository currently covers seven languages: Bulgarian, Turkish, Swahili, Kyrgyz, Serbian, Amharic and Gujarati
Various Docker Modules
mBART pretraining of Marian models
Morphological segmentation using Apertium resources
Hierarchical decoding (word RNN and decoding of words character by character)
Latent modelling of morphology in NMT
Neural n-to-m alignment models
Deep latent variable models for language modelling
Interpretable text classifiers with sparse relaxations to discrete random variables
Training data for document-level machine translation
Contrastive test sets for document-level machine translation
Auto-encoding variational neural machine translation
Bayesian data analysis of NMT models
LASER train (language-agnostic sentence embeddings): It reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings.
LinguaCrawl: It is used to crawl top-level domains. It has been completely developed within the GoURMET project and is compatible with Bitextor, so the data crawled with it can be processed with Bitextor.
WMT19 Gujarati system models and scripts
Tool for fusing, extending and using language representations
Code for the improving massively multilingual NMT work
Code for the language model prior work
Code for the auto-encoding variational NMT work
Bitextor: It is used to identify, align and clean parallel data by crawling web domains specified by the user.
It is developed in tight coordination with the Paracrawl project. Most contributions related to GoURMET focus on the addition of components that allow to improve the performance of the tool for under-resourced languages.
Bicleaner: It is used to filter noisy segment pairs in parallel corpora.
We have contributed to the master branch and we are also working on another branch (https://github.com/bitextor/bicleaner/tree/bicleaner-0.14-NAACL20) that will eventually be merged with the master one.
Constrained optimisation for deep generative models in torch
Probability distributions for torch including sparse relaxations to discrete random variables
Probabilistic modules for torch
BayerSeq. Software package that implements the variational autoencoder models of sentences that we used for data augmentation.
mtl-da Scripts for training machine translation systems using different data augmentation techniques in the target language.
Direct Assessment – Sentence Pairs Evaluation Tool
The goal of Direct Assessment is to evaluate a translation model by asking a human to compare the quality of a machine translated sentence to a human translated sentence where the human translation is assumed to be the gold standard
Gap Fill Evaluation Tool
The goal of Gap Fill is to evaluate a translation model by asking a human to fill in the gaps in a sentence that has been translated by a human using the machine translation of the same sentence as a guide to what words should go in that sentence.