Prototypes and Technologies
The GoURMET project will develop methods for extracting data and training reliable and robust machine translation, for languages and domains with low amounts of translated training materials. In order to do this, we will pursue four complimentary research directions and we will also integrate our solutions into our user partner workflows.
Four main work fields
Data Gathering and Augmentation: A crucial part of the project will be gathering existing resources available for languages which are identified as project priorities, and being able to do this quickly and easily for new languages. We will build on existing tools for extracting high-quality corpora and language resources from large Internet crawls and improving their ability to handle low-resource languages. We will also use the latest deep learning techniques to augment existing data resources, expanding their coverage of words and morphological variants.
Modelling Morphological Structure: In order to learn from very little training data it is essential to investigate how best to represent the basic primitives of our model. We will look at how to model words and the morphological structure of words within a neural machine translation model by inducing linguistically plausible segmentation, leveraging word alignments to learn morphology across multiple languages and integrating latent features.
Structure Induction at Sentence Level: We will induce and exploit sentence-level structure in neural machine translation models in order to improve the learnability of the models. We will rely on latent alignments and graph structured models to learn reusable patterns from plain text corpora.
Transfer Learning: In order for us to successfully deploy translation in the low-resource setting, we need to be able to leverage all possible related data sources, for example monolingual data, dictionaries, and translations in other languages. We will do this by developing new techniques for transferring knowledge from related tasks such as language modelling, word prediction and translation of related languages.