UDDATE: 7/6 Additional details on the evaluation server are up! Please submit early.
UPDATE: 6/4 FLORES 101 and devtest data are up!
One of the most exciting recent trends in NLP is training a single system on multiple languages at once. In particular, a multilingual machine translation system may be capable of translating a sentence into several languages, or to translate sentences from several languages into a given language or any combinations thereof.
This is a powerful paradigm for two reasons: From a practical perspective, it greatly simplifies system development and deployment, as only a single model needs to be built and used for all language pairs, as opposed to one for each language pair. Second, it has the potential to improve the translation quality on low-resource language pairs by leveraging the ability of the single multilingual machine translation system to transfer knowledge from similar but higher resource language pairs and data in similar domains but in different languages.
However, to date evaluation of multilingual machine translation systems has been hindered by the lack of high quality evaluation benchmarks and the lack of a standardized evaluation process.
The goal of this task is to bring the community together on the topic of low-resource multilingual machine translation for the first time, in the hope to foster progress in this exciting direction. We do so by introducing a realistic benchmark as well as a fair and rigorous evaluation process, as described below.
We are going to have three tracks: two small tasks and a large task.
The small tracks evaluate translation between fairly related languages and English (all pairs). The large track uses 101 languages.
The small tracks are an example of a practical MMT problem in similar languages, which does not require very large computational resources at training time, particularly so given the pretrained models we provide.
At the other end of the spectrum, the large track explores the very ambitious goal to translate all at once a very large number of languages which may require a substantial amount of compute at training time.
Small Track #1 : 5 Central/East European languages, 30 directions: Croatian, Hungarian, Estonian, Serbian, Macedonian, English
Small Track #2: 5 South East Asian languages, 30 directions:
Sundanese, Javanese, Indonesian, Malay, Tagalog, Tamil, English
(note: Sundanese was not included in the final release of FLORES due to quality issues)
Large Track: All Languages, to and from English. Full list at the bottom of this page.
Constrained The small tracks are constrained, i.e. only the data available in this page is allowed. Pre-trained models are OK to use as long as they are publicly available for download.
Unconstrained The large track is fully unconstrained on model and training data.
We want to continue encouraging the research community to work on low-resource translation. As part of this, we encourage participants to apply for compute grants so that GPU compute is less of a barrier for translation research. You can see more detailed information and apply for the compute grants here.
The training data is provided by the publicly available Opus repository, which contains data of various quality from a variety of domains. We also provide in-domain Wikipedia monolingual data for each language. All tracks will be fully constrained, so only the data that is provided can be used. This will enable fairer comparison across methods. Check the multilingual data page for a detailed view of the resources.
The validation and test data are obtained from the Flores 101 evaluation benchmark, which will be made available in June 2021. This is a high-quality evaluation benchmark that enables evaluation of MMT systems in more than a hundred languages. It supports many-to-many evaluation, as all sentences are aligned across all languages. In particular, we will provide a validation and validation-test datasets to aid the development of systems. The actual evaluation will be performed on a dedicated evaluation server, where participants will upload their evaluation code.
Training data: available here . Parallel data from Opus Monolingual data from Wikipedia
|Model||Num layers||Embed dimension||FFN dimension||Vocab Size||#params||Download|
These models are trained similar to M2M-100 with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.
For more information look at the fairseq page.
There will be two submission periods.
The first submission period which lasts until August 8, 2021 will let participants submit their code to an evaluation server in order to evaluate on a hidden test set. We strongly recommend that you evaluate your model early, as you might encounter specific issues while writing your handler. For support on handling the submission, please open a github issue, and email firstname.lastname@example.org.
The second and final submission period, from August 9 till August 13, will let participants submit their code only once to evaluate on a different hidden test set (same domain as the first hidden test set, as well as the provided dev and devtest datasets).