Multilingual Low-Resource Translation for Indo-European Languages - EMNLP sixth Conference on Machine Translation

Shared Task: Multilingual Low-Resource Translation for Indo-European Languages

The automatic evaluation results are now available!

TASK DESCRIPTION

Massively multilingual machine translation has shown impressive capabilities, including zero and few-shot translation of low-resource languages. However, these models are often evaluated from or into English, where the most data is available, and assuming that models generalise to other pairs and low-resource languages.

In the first edition of the Multilingual Low-Resource Translation shared task, we focus on multilinguality in the cultural heritage domain for two Indo-European language families: North-Germanic and Romance. We want to explore how information in one language can be transferred to other related languages and, for this, we evaluate translation quality in low-resourced language pairs, but explicitly encourage the use of data of the high-resourced language-pairs in the same family. We would like to answer the question: do we need English and/or Spanish for high quality translation of related languages? If so, which is the best way to combine the data: pivot, cascade, multilingual, via pre-training? To explore the topic, we present two tasks with different characteristics, one per language family (see below). Even if data in the highest resourced languages is allowed and encouraged for training, translation quality will be only evaluated on the other pairs. Participants are also encouraged to make available to the community any additional resources useful for the task.

Subtasks

Task 1. Europeana thesis abstracts and descriptions. North-Germanic languages: from/to Icelandic, Norwegian Bokmål and Swedish. Danish, German and English data is allowed for training but translation is not evaluated.

Task 2. Wikipedia cultural heritage articles. Romance languages: from Catalan to Occitan, Romanian and Italian. Spanish, French and Portuguese data (+ English!) is allowed for training but translation is not evaluated.

DATA

Monolingual, Parallel and Multilingual Data

Corpora available at ELRC for the languages of the task (and the close languages stated above)
This data includes Paracrawl and Global voices.
Europarl, JW300, WikiMatrix, MultiCCAligned, OPUS-100, Books, the Bible and TED talks
Common Crawl, Wikipedia and Wikidata dumps
Wordnets with open license, BabelNet
(Multiligual) pre-trained embeddings or other models that can be found freely available online. See Hugging Face resources
Additional resources below

Additional Resources

Given the importance of named entities in the cultural heritage domain, we provide participants with parallel/multilingual lexicons from Wiktionary, Wikidata and Wikipedia titles. One can also create its own resource from Wikimedia, we just extract the data for the shared task languages to facilitate participation.

	North-Germanic		Romance			Info
	is-nb-sv	bilinguals	ca-it-oc-ro	ca-it-ro	bilinguals
Wikidata	all cleaner	all	all cleaner	all cleaner	all	README
Wiktionary	----	all	----	----	all	README
Wikipedia Titles	----	----	all cleaner	all cleaner	all	README

Validation Sets

Send us a brief message stating your interest to get the password and download the validation sets.

Test Sets

29.06.2021 The test set is now available!
See submission details below and don't forget to fill in the online form when submitting your systems.

13.07.2021 The full test set (source+translations) is available.
Note for the Task 1 (Europeana, North-Germanic family) participants: The test set europeana.test.sv.xml used in the evaluation had accidentally an additional line break at line 618 ("I min uppsats har jag valt att undersöka hur och i vilket syfte pedagoger använder sig av musik i förskol \\ an." should be a single line and "förskolan." a token). This has been taken into account in the evaluation. The full test set corrects this mistake.

EVALUATION

Task 1. Europeana thesis abstracts and descriptions. We will provide a test set with a similar amount of abstracts in the 3 languages that have to be translated to the other 2. The data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language.

Task 2. Wikipedia cultural heritage articles. We will provide a test set with articles in Catalan that need to be translated into the other 3 languages. Similarly to Task 1, the data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language.

Automatic evaluation results will be reported per language pair and in average. The final ranking will be done according to the average translation quality per subtask, that is, per family and not per pair. When possible, we will also perform human evaluation at document level on a subset of paragraphs/sentences. If you plan to participate, please, contact us so that we can plan the human evaluation. If you are native in any of the target languages involved and are interested in the evaluation, please, also contact us!

Since the official evaluation will be done per family, participants can participate in only one of the tasks if preferred, but they are expected to submit translations for all the languages involved in the task.

SUBMISSION

Test sets will be released in the same format as the validation sets and must be returned also in the same format with the addition of a tag to include the system name, e.g. sysid="myTeamPrimary". Please, use this same name as the first part of the file name for your submission (see below).

Two submissions per group and task are allowed: a primary and a contrastive system. Participants will have to fill a form at submission time describing the main characteristics per system.

File naming. Name your files with your system name (including type of submission primary vs. contrastive), language family and translation direction. Examples:

myTeamPrimary.romance.ca2it.xml
myTeamContrastive.romance.ca2it.xml
myTeamPrimary.germanic.is2sv.xml

Send your translations by email to cristinae aatt dfki.de and fill this short online form per system (for each combination Romance/Germanic/Primary/Contrastive). You will receive a confirmation email in few hours.

SYSTEM PAPER

Follow the guidelines in WMT main page for the system description paper. Besides, we will use the information you give in the online form for the task description paper.

IMPORTANT DATES

Release of initial training data	April 5, 2021
Additional training data deadline	May 19, 2021
Release of test data	June 29, 2021
Results submission deadline	July 6, 2021
Paper submission deadline	August 5, 2021
Paper notification	September 5, 2021
Camera-ready version due	September 15, 2021
Conference at EMNLP	November 10-11, 2021

ORGANISERS

Anastasija Amann (DFKI GmbH, Germany)
Kwabena Amponsah-Kaakyire (DFKI GmbH, Germany)
Cristina España-Bonet (DFKI GmbH, Germany)
Josef van Genabith (DFKI GmbH, Germany)
Leonie Harter (DFKI GmbH, Germany)

Contact

Interested in the task? Join the WMT google group for questions or comments and/or email cristinae aatt dfki.de.

ACKNOWLEDGEMENTS

This shared task is funded by the European Language Resource Coordination ELRC (SMART 2019/1083) and LT-BRIDGE (H2020, 952194), and supported by the Directorate-General for Language Policy, Ministry of Culture. Government of Catalonia. We are thankful to Europeana for providing source texts in Icelandic, Norwegian and Swedish.