Shared Task: Large-Scale Machine Translation Evaluation for African Languages
Overview
Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly
belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held
back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource
languages. However, these modelling efforts have been hindered by the lack of high quality, standardised
evaluation benchmarks.
For the second edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of
machine translation for a set of 24 African languages. We do so by introducing a high quality benchmark, paired
with a fair and rigorous evaluation procedure.
Task Description
The shared task will consist of three tracks.
- The Data track focuses on the contribution of novel corpora. Participants may submit
monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of
languages. Further information on this track is available below.
- Two translation tracks will evaluate the performance of translation models covering all of this year’s
languages. Translation will be evaluated to and from English and French as well as to/from select African
languages within particular geographical/cultural clusters:
- In the Constrained Translation track only the data listed on this page will be
allowed, including submissions accepted to the Data track (see the Resources section below).
The use of open source pre-trained models will be permitted, provided that they are published before
the Data track submission deadline.
- In the Unconstrained Translation track no restrictions will be made on the use of
data or models.
Resources
- To facilitate work in the Data track, we have released LASER sentence encoders supporting all the relevant
languages. LASER is a sentence
representation toolkit which enables the fast mining of parallel corpora (Heffernan et al, 2022). The encoders may be obtained here.
- The validation and test data are based on FLORES-101, a high quality benchmark which supports many-to-many evaluation in over
a hundred languages. We will be publishing supplements to FLORES which will complete its coverage of this
task’s set of languages. Validation and validation-test datasets will be provided, while the actual
evaluation will be performed on a dedicated server where participants will upload their models.
- Submissions in the Constrained Translation track are only allowed to use data from the
following sources. If you would like to use other sources, please submit to the Unconstrained
Translation track.
Full list of languages
Focus languages:
Afrikaans - afr |
Lingala - lin |
Swati - ssw |
Amharic - amh |
Luganda - lug |
Tswana - tsn |
Chichewa - nya |
Luo - luo |
Umbundu - umb |
Nigerian Fulfulde - fuv |
Northern Sotho - nso |
Wolof - wol |
Hausa - hau |
Oromo - orm |
Xhosa - xho |
Igbo - ibo |
Shona - sna |
Xitsonga - tso |
Kamba - kam |
Somali - som |
Yoruba - yor |
Kinyarwanda - kin |
Swahili - swh |
Zulu - zul |
Colonial linguae francae: English - eng, French - fra
Evaluation
>>> Submission instructions and details HERE <<<
Due to computational and budgetary constraints, manual and human evaluation will be conducted on a small set of
language pairs from the FLORES-101 dataset. You can download it using
this script . Specifically, we will evaluate on the following 100 language pairs:
- Translation from the focus languages to and from the pivots [48 pairs]:
- to/from English: Afrikaans, Amharic, Chichewa, Nigerian Fulfulde, Hausa, Igbo, Kamba, Kinyarwanda,
Luganda, Luo, Northern Sotho, Oromo, Shona, Somali, Swahili, Swati, Tswana, Umbundu, Xhosa,
Xitsonga, Yoruba, Zulu
- to/from French: Kinyarwanda, Lingala, Swahili, Wolof
- An additional select 52 pairs within geographical/cultural clusters, to be selected based on
translators/annotators availability (specifics here):
- South/South East Africa: pairs among Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa,
Xitsonga, Zulu
- Horn of Africa and Central/East Africa: Amharic, Oromo, Somali, Swahili, Luo
- Nigeria and Gulf of Guinea: Nigerian Fulfulde, Hausa, Igbo, Yoruba
- Central Africa: Chichewa, Kinyarwanda, Lingala, Luganda, Swahili
Automatic Metrics: The systems will be evaluated on a suite of automatic metrics:
- Accuracy measures: BLEU, chrF++, and potentially a version of COMET tuned on African languages
- Fairness measures: measures of cross-lingual fairness (more details forthcoming)
Participants are encouraged but not required to handle all language pairs. Submissions dealing with only a subset of pairs will be admissible.
Data track
The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or
multilingual datasets relevant to the training of MT models for this year’s set of languages.
Data track: Submissions
- Data has to be submitted in the most raw versions, no pre-done tokenization, deadline: May 10
- Data has to be submitted through this form. The form requires a link to the hosted version of the data.
- License - We encourage data submissions with a permissive license (e.g. CC-0) that will allow participants
to use data in their model training.
Data track: Evaluation
- There will not be an official evaluation metric for this track. Instead, we will document the data sources
according to their usage as reported by the main track participants.
- Data will be ranked based on how many groups have used it to train systems in the evaluation. The more
participants have used the data, the better ranked the data contribution.
- As a measure of data usefulness we will also report scores obtained from fine-tuning a pre-trained model in
these additional resources and evaluating BLEU against FLORES101 devtest.
Data track: Paper submission
Data track will require either the submission of an extended abstract [2-4] pages or a paragraph describing the
dataset, together with the datasheet [example templates: [1] [2]].
Participants who submit datasets should ensure that data is correctly credited by giving attribution not only to the data collectors
but also to the people from whom the data was originally collected.
The deadline for this submission is the same as system description papers.
Compute grants
In order to facilitate the work on low-resource translation and mitigate the cost of training and/or
fine-tuning large models, we will be able to provide Microsoft Azure credits so that GPU compute is less of a barrier for translation research.
To apply for credits please fill in this brief form.
Important dates
- Release of the encoders for mining for the data track, April 1
- Data track submission deadline and model availability deadline, May 10
- Training data is released, May 17
- Evaluation opens,
Jun 28 Jul 13
- Evaluation period ends,
Jul 28 Sep 5
- Paper submission deadline,
Sep 7 Sep 26
- Paper notification, Oct 8
- Camera-ready version due, Oct 16
- Conference (EMNLP), Dec 7-8
Contact
Interested in the task? Please join the WMT google group for any further questions or comments.
Organizers
Antonios Anastasopoulos, George Mason University.
Vukosi Marivate, University of Pretoria, Masakhane NLP, Deep Learning Indaba
David Adelani, Saarland University, Masakhane NLP
Marta R. Costa-jussà, Meta AI
Paco Guzmán, Meta AI
Jean Maillard, Meta AI
Safiyyah Saleem, Meta AI
Holger Schwenk, Meta AI
Natalia Fedorova, Toloka AI
Sergey Koshelev, Toloka AI
Akshita Bhagia, AI2
Jesse Dodge, AI2
Md Mahfuz ibn Alam, George Mason University
Jonathan Mbuya, George Mason University.
Fahim Faisal, George Mason University.