Shared Task: Large-Scale Machine Translation Evaluation for African Languages
Overview
Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly
belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held
back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource
languages. However, these modelling efforts have been hindered by the lack of high quality, standardised
evaluation benchmarks.
For the second edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of
machine translation for a set of 24 African languages. We do so by introducing a high quality benchmark, paired
with a fair and rigorous evaluation procedure.
Task Description
The shared task will consist of three tracks.
- The Data track focuses on the contribution of novel corpora. Participants may submit
monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of
languages. Further information on this track is available below.
- Two translation tracks will evaluate the performance of translation models covering all of this year’s
languages. Translation will be evaluated to and from English and French as well as to/from select African
languages within particular geographical/cultural clusters:
- In the Constrained Translation track only the data listed on this page will be
allowed, including submissions accepted to the Data track. The use of open source pre-trained models
will be permitted, provided that they are published before the Data track submission deadline.
- In the Unconstrained Translation track no restrictions will be made on the use of
data or models.
Resources
- To facilitate work in the Data track, we have released LASER sentence encoders supporting all the relevant
languages. LASER is a sentence
representation toolkit which enables the fast mining of parallel corpora. The encoders may be obtained here.
- The validation and test data are based on FLORES-101, a high quality benchmark which supports many-to-many evaluation in over
a hundred languages. We will be publishing supplements to FLORES which will complete its coverage of this
task’s set of languages. Validation and validation-test datasets will be provided, while the actual
evaluation will be performed on a dedicated server where participants will upload their models.
- For all languages, we will be providing parallel corpora mined from crawled data. Additional training data,
covering a subset of languages, are available from the
public OPUS repository.
Full list of languages
Focus languages:
Afrikaans - afr |
Lingala - lin |
Swati - ssw |
Amharic - amh |
Luganda - lug |
Tswana - tsn |
Chichewa - nya |
Luo - luo |
Umbundu - umb |
Nigerian Fulfulde - fuv |
Northern Sotho - nso |
Wolof - wol |
Hausa - hau |
Oromo - orm |
Xhosa - xho |
Igbo - ibo |
Shona - sna |
Xitsonga - tso |
Kamba - kam |
Somali - som |
Yoruba - yor |
Kinyarwanda - kin |
Swahili - swh |
Zulu - zul |
Colonial linguae francae: English - eng, French - fra
Evaluation
Due to computational and budgetary constraints, manual and human evaluation will be conducted on a small set of
language pairs. Specifically, we will evaluate on the following 100 language pairs:
- Translation from the focus languages to and from the pivots [48 pairs]:
- to/from English: Afrikaans, Amharic, Chichewa, Nigerian Fulfulde, Hausa, Igbo, Kamba, Kinyarwanda,
Luganda, Luo, Northern Sotho, Oromo, Shona, Somali, Swahili, Swati, Tswana, Umbundu, Xhosa,
Xitsonga, Yoruba, Zulu
- to/from French: Kinyarwanda, Lingala, Swahili, Wolof
- An additional select 52 pairs within geographical/cultural clusters, to be selected based on
translators/annotators availability (specifics to be announced shortly):
- South/South East Africa: pairs among Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa,
Xitsonga, Zulu
- Horn of Africa and Central/East Africa: Amharic, Oromo, Somali, Swahili, Luo
- Nigeria and Gulf of Guinea: Nigerian Fulfulde, Hausa, Igbo, Yoruba
- Central Africa: Chichewa, Kinyarwanda, Lingala, Luganda, Swahili
Automatic Metrics: The systems will be evaluated on a suite of automatic metrics:
- Accuracy measures: BLEU, chrF++, and potentially a version of COMET tuned on African languages
- Fairness measures: measures of cross-lingual fairness (more details forthcoming)
Participants are encouraged but not required to handle all language pairs. Submissions dealing with only a subset
of pairs will be admissible.
More information on the evaluation dashboard and the human evaluation protocols will be released shortly.
Data track
The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or
multilingual datasets relevant to the training of MT models for this year’s set of languages.
Data track: Submissions
- Data has to be submitted in the most raw versions, no pre-done tokenization, deadline: May 10
- Data has to be submitted through this form. The form requires a link to the hosted version of the data.
- License - We encourage data submissions with a permissive license (e.g. CC-0) that will allow participants
to use data in their model training.
Data track: Evaluation
- There will not be an official evaluation metric for this track. Instead, we will document the data sources
according to their usage as reported by the main track participants.
- Data will be ranked based on how many groups have used it to train systems in the evaluation. The more
participants have used the data, the better ranked the data contribution.
- As a measure of data usefulness we will also report scores obtained from fine-tuning a pre-trained model in
these additional resources and evaluating BLEU against FLORES101 devtest.
Data track: Paper submission
Data track will require either the submission of an extended abstract [2-4] pages or a paragraph describing the
dataset, together with the datasheet [templates [1] [2]]. The deadline for this submission is the same as system description papers.
Compute grants
In order to facilitate the work on low-resource translation, we encourage participants to apply for compute
grants so that GPU compute is less of a barrier for translation research. Further information on compute grants
and how to apply for them will be made available shortly.
Important dates
- Release of the encoders for mining for the data track, April 1
- Data track submission deadline and model availability deadline, May 10
- Training data is released, May 17
- Evaluation dashboard opens, Jun TBD
- Evaluation period ends, Jul TBD
- Paper submission deadline, Aug TBD
- Paper notification, Sep TBD
- Camera-ready version due, Oct TBD
- Conference (EMNLP), Dec TBD
Contact
Interested in the task? Please join the WMT google group for
any further questions or comments.
Organizers
Antonios Anastasopoulos, George Mason University.
Vukosi Marivate, University of Pretoria, Masakhane NLP, Deep Learning Indaba
David Adelani, Saarland University, Masakhane NLP
Marta R. Costa-jussà, Meta AI
Jean Maillard, Meta AI
Paco Guzmán, Meta AI
Holger Schwenk, Meta AI
Natalia Fedorova, Toloka