Large-Scale Machine Translation Evaluation for African Languages

Shared Task: Large-Scale Machine Translation Evaluation for African Languages

Overview

Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource languages. However, these modelling efforts have been hindered by the lack of high quality, standardised evaluation benchmarks.

For the second edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of machine translation for a set of 24 African languages. We do so by introducing a high quality benchmark, paired with a fair and rigorous evaluation procedure.

Task Description

The shared task will consist of three tracks.

The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of languages. Further information on this track is available below.
Two translation tracks will evaluate the performance of translation models covering all of this year’s languages. Translation will be evaluated to and from English and French as well as to/from select African languages within particular geographical/cultural clusters:
- In the Constrained Translation track only the data listed on this page will be allowed, including submissions accepted to the Data track (see the Resources section below). The use of open source pre-trained models will be permitted, provided that they are published before the Data track submission deadline.
- In the Unconstrained Translation track no restrictions will be made on the use of data or models.

Resources

To facilitate work in the Data track, we have released LASER sentence encoders supporting all the relevant languages. LASER is a sentence representation toolkit which enables the fast mining of parallel corpora (Heffernan et al, 2022). The encoders may be obtained here.
The validation and test data are based on FLORES-101, a high quality benchmark which supports many-to-many evaluation in over a hundred languages. We will be publishing supplements to FLORES which will complete its coverage of this task’s set of languages. Validation and validation-test datasets will be provided, while the actual evaluation will be performed on a dedicated server where participants will upload their models.
Submissions in the Constrained Translation track are only allowed to use data from the following sources. If you would like to use other sources, please submit to the Unconstrained Translation track.
- Data from Data Track participants
- OPUS
- Parallel corpora mined from crawled data

Full list of languages

Focus languages:

Afrikaans - afr	Lingala - lin	Swati - ssw
Amharic - amh	Luganda - lug	Tswana - tsn
Chichewa - nya	Luo - luo	Umbundu - umb
Nigerian Fulfulde - fuv	Northern Sotho - nso	Wolof - wol
Hausa - hau	Oromo - orm	Xhosa - xho
Igbo - ibo	Shona - sna	Xitsonga - tso
Kamba - kam	Somali - som	Yoruba - yor
Kinyarwanda - kin	Swahili - swh	Zulu - zul

Colonial linguae francae: English - eng, French - fra

Evaluation

>>> Submission instructions and details HERE <<<

Due to computational and budgetary constraints, manual and human evaluation will be conducted on a small set of language pairs from the FLORES-101 dataset. You can download it using this script . Specifically, we will evaluate on the following 100 language pairs:

Translation from the focus languages to and from the pivots [48 pairs]:
- to/from English: Afrikaans, Amharic, Chichewa, Nigerian Fulfulde, Hausa, Igbo, Kamba, Kinyarwanda, Luganda, Luo, Northern Sotho, Oromo, Shona, Somali, Swahili, Swati, Tswana, Umbundu, Xhosa, Xitsonga, Yoruba, Zulu
- to/from French: Kinyarwanda, Lingala, Swahili, Wolof
An additional select 52 pairs within geographical/cultural clusters, to be selected based on translators/annotators availability (specifics here):
- South/South East Africa: pairs among Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu
- Horn of Africa and Central/East Africa: Amharic, Oromo, Somali, Swahili, Luo
- Nigeria and Gulf of Guinea: Nigerian Fulfulde, Hausa, Igbo, Yoruba
- Central Africa: Chichewa, Kinyarwanda, Lingala, Luganda, Swahili

Automatic Metrics: The systems will be evaluated on a suite of automatic metrics:

Accuracy measures: BLEU, chrF++, and potentially a version of COMET tuned on African languages
Fairness measures: measures of cross-lingual fairness (more details forthcoming)

Participants are encouraged but not required to handle all language pairs. Submissions dealing with only a subset of pairs will be admissible.

Data track

The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of languages.

Data track: Submissions

Data has to be submitted in the most raw versions, no pre-done tokenization, deadline: May 10
Data has to be submitted through this form. The form requires a link to the hosted version of the data.
License - We encourage data submissions with a permissive license (e.g. CC-0) that will allow participants to use data in their model training.

Data track: Evaluation

There will not be an official evaluation metric for this track. Instead, we will document the data sources according to their usage as reported by the main track participants.
Data will be ranked based on how many groups have used it to train systems in the evaluation. The more participants have used the data, the better ranked the data contribution.
As a measure of data usefulness we will also report scores obtained from fine-tuning a pre-trained model in these additional resources and evaluating BLEU against FLORES101 devtest.

Data track: Paper submission

Data track will require either the submission of an extended abstract [2-4] pages or a paragraph describing the dataset, together with the datasheet [example templates: [1] [2]]. Participants who submit datasets should ensure that data is correctly credited by giving attribution not only to the data collectors but also to the people from whom the data was originally collected.

The deadline for this submission is the same as system description papers.

Compute grants

In order to facilitate the work on low-resource translation and mitigate the cost of training and/or fine-tuning large models, we will be able to provide Microsoft Azure credits so that GPU compute is less of a barrier for translation research.

To apply for credits please fill in this brief form.

Important dates

Release of the encoders for mining for the data track, April 1
Data track submission deadline and model availability deadline, May 10
Training data is released, May 17
Evaluation opens, ~~Jun 28~~ Jul 13
Evaluation period ends, ~~Jul 28~~ Sep 5
Paper submission deadline, ~~Sep 7~~ Sep 26
Paper notification, Oct 8
Camera-ready version due, Oct 16
Conference (EMNLP), Dec 7-8

Contact

Interested in the task? Please join the WMT google group for any further questions or comments.

Organizers

Antonios Anastasopoulos, George Mason University.
Vukosi Marivate, University of Pretoria, Masakhane NLP, Deep Learning Indaba
David Adelani, Saarland University, Masakhane NLP
Marta R. Costa-jussà, Meta AI
Paco Guzmán, Meta AI
Jean Maillard, Meta AI
Safiyyah Saleem, Meta AI
Holger Schwenk, Meta AI
Natalia Fedorova, Toloka AI
Sergey Koshelev, Toloka AI
Akshita Bhagia, AI2
Jesse Dodge, AI2
Md Mahfuz ibn Alam, George Mason University
Jonathan Mbuya, George Mason University.
Fahim Faisal, George Mason University.