Large-Scale Multilingual Machine Translation - ACL 2021 Sixth Conference on Machine Translation

Shared Task: Large-Scale Multilingual Machine Translation

UDDATE: 7/6 Additional details on the evaluation server are up! Please submit early.

UPDATE: 6/4 FLORES 101 and devtest data are up!

Overview

One of the most exciting recent trends in NLP is training a single system on multiple languages at once. In particular, a multilingual machine translation system may be capable of translating a sentence into several languages, or to translate sentences from several languages into a given language or any combinations thereof.

This is a powerful paradigm for two reasons: From a practical perspective, it greatly simplifies system development and deployment, as only a single model needs to be built and used for all language pairs, as opposed to one for each language pair. Second, it has the potential to improve the translation quality on low-resource language pairs by leveraging the ability of the single multilingual machine translation system to transfer knowledge from similar but higher resource language pairs and data in similar domains but in different languages.

However, to date evaluation of multilingual machine translation systems has been hindered by the lack of high quality evaluation benchmarks and the lack of a standardized evaluation process.

Goals

The goal of this task is to bring the community together on the topic of low-resource multilingual machine translation for the first time, in the hope to foster progress in this exciting direction. We do so by introducing a realistic benchmark as well as a fair and rigorous evaluation process, as described below.

Task Description

We are going to have three tracks: two small tasks and a large task.
The small tracks evaluate translation between fairly related languages and English (all pairs). The large track uses 101 languages.

The small tracks are an example of a practical MMT problem in similar languages, which does not require very large computational resources at training time, particularly so given the pretrained models we provide.

At the other end of the spectrum, the large track explores the very ambitious goal to translate all at once a very large number of languages which may require a substantial amount of compute at training time.

Track Details

Small Track #1 : 5 Central/East European languages, 30 directions: Croatian, Hungarian, Estonian, Serbian, Macedonian, English
Small Track #2: 5 South East Asian languages, 30 directions: ~~Sundanese,~~ Javanese, Indonesian, Malay, Tagalog, Tamil, English (note: Sundanese was not included in the final release of FLORES due to quality issues)
Large Track: All Languages, to and from English. Full list at the bottom of this page.

Allowed resources

Constrained The small tracks are constrained, i.e. only the data available in this page is allowed. Pre-trained models are OK to use as long as they are publicly available for download.
Unconstrained The large track is fully unconstrained on model and training data.

Compute grants

We want to continue encouraging the research community to work on low-resource translation. As part of this, we encourage participants to apply for compute grants so that GPU compute is less of a barrier for translation research. You can see more detailed information and apply for the compute grants here.

Data

The training data is provided by the publicly available Opus repository, which contains data of various quality from a variety of domains. We also provide in-domain Wikipedia monolingual data for each language. All tracks will be fully constrained, so only the data that is provided can be used. This will enable fairer comparison across methods. Check the multilingual data page for a detailed view of the resources.

The validation and test data are obtained from the Flores 101 evaluation benchmark, which will be made available in June 2021.
This is a high-quality evaluation benchmark that enables evaluation of MMT systems in more than a hundred languages. It supports many-to-many evaluation, as all sentences are aligned across all languages. In particular, we will provide a validation and validation-test datasets to aid the development of systems. The actual evaluation will be performed on a dedicated evaluation server, where participants will upload their evaluation code.

Training data: available here . Parallel data from Opus Monolingual data from Wikipedia

Evaluation Data: Flores 101, dev and devtest. available here .

Pre-trained models: Pretrained models are available!

Model	Num layers	Embed dimension	FFN dimension	Vocab Size	#params	Download
`flores101_mm100_615M`	12	1024	4096	256,000	615M	https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
`flores101_mm100_175M`	6	512	2048	256,000	175M	https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz

These models are trained similar to M2M-100 with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.

For more information look at the fairseq page.

Evaluation

We'll be using the sentence-piece BLEU (spBLEU) variant for evaluation. All scripts and instructions are available in the FLORES respository.

Submission

Submission to the leaderboard is avaliable!! Look at this guide for reference on how to do it.

There will be two submission periods.

The first submission period which lasts until August 8, 2021 will let participants submit their code to an evaluation server in order to evaluate on a hidden test set. We strongly recommend that you evaluate your model early, as you might encounter specific issues while writing your handler. For support on handling the submission, please open a github issue, and email flores@fb.com.
The second and final submission period, from August 9 till August 13, will let participants submit their code only once to evaluate on a different hidden test set (same domain as the first hidden test set, as well as the provided dev and devtest datasets).

Participants will be required to submit code that fits certain memory and compute requirements to fit a p2.xlarge AWS instance, strictly constrained.

Contact flores@fb.com

Leaderboard

Important dates

~~Release of training data: April 2021~~
~~Release of dev and dev-test data: June 4 2021 (prior to that we encourage participants to use a portion of the training set for validation purposes)~~
Evaluation server opening for submissions: June, 2021
Evaluation on final test set: August 9-13, 2021
Notification of results: August 15 16, 2021
Draft of system papers: August 31, 2021
Reviews due: September 6 7, 2021
Camera-ready version of system papers: September 15, 2021

Full language list

Afrikaans
Amharic
Arabic
Armenian
Assamese
Asturian
Azerbaijani
Belarusian
Bengali
Bosnian
Bulgarian
Burmese
Catalan
Cebuano
Chinese (Simplified)
Chinese (Traditional)
Croatian
Czech
Danish

Dutch
English
Estonian
Filipino (Tagalog)
Finnish
French
Fula
Galician
Ganda
Georgian
German
Greek
Gujarati
Hausa
Hebrew
Hindi
Hungarian
Icelandic
Igbo

Indonesian
Irish
Italian
Japanese
Javanese
Kabuverdianu
Kamba
Kannada
Kazakh
Khmer
Korean
Kyrgyz
Lao
Latvian
Lingala
Lithuanian
Luo
Luxembourgish
Macedonian

Malay
Malayalam
Maltese
Maori
Marathi
Mongolian
Nepali
Northern Sotho
Norwegian
Nyanja
Occitan
Oriya
Oromo
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian

Russian
Serbian
Shona
Sindhi
Slovak
Slovenian
Somali
Sorani Kurdish
Spanish
Swahili
Swedish
Tajik
Tamil
Telugu
Thai
Turkish
Ukrainian
Umbundu
Urdu

Uzbek
Vietnamese
Welsh
Wolof
Xhosa
Yoruba
Zulu