Shared Task: Multilingual Low-Resource Translation for Indo-European Languages



PAGE IN PROGRESS, participants can suggest/provide new training data till May 19th, STAY TUNED!!

TASK DESCRIPTION

Massively multilingual machine translation has shown impressive capabilities, including zero and few-shot translation of low-resource languages. However, these models are often evaluated from or into English, where the most data is available, and assuming that models generalise to other pairs and low-resource languages.

In the first edition of the Multilingual Low-Resource Translation shared task, we focus on multilinguality in the cultural heritage domain for two Indo-European language families: North-Germanic and Romance. We want to explore how information in one language can be transferred to other related languages and, for this, we evaluate translation quality in low-resourced language pairs, but explicitly encourage the use of data of the high-resourced language-pairs in the same family. We would like to answer the question: do we need English and/or Spanish for high quality translation of related languages? If so, which is the best way to combine the data: pivot, cascade, multilingual, via pre-training? To explore the topic, we present two tasks with different characteristics, one per language family (see below). Even if data in the highest resourced languages is allowed and encouraged for training, translation quality will be only evaluated on the other pairs. Participants are also encouraged to make available to the community any additional resources useful for the task.

Subtasks

DATA

Monolingual, Parallel and Multilingual Data

Additional Resources

Given the importance of named entities in the cultural heritage domain, we provide participants with parallel/multilingual lexicons from Wiktionary, Wikidata and Wikipedia titles. One can also create its own resource from Wikimedia, we just extract the data for the shared task languages to facilitate participation.

North-Germanic
Romance
Info
is-nb-sv bilinguals ca-it-oc-ro ca-it-ro bilinguals
Wikidata all cleaner all ---- all cleaner all cleaner all ---- README
Wiktionary
Wikipedia Titles all cleaner all cleaner all ---- README

Validation Sets

Send us a brief message stating your interest to get the password and download the validation sets.

Test Sets

Come back on June 29, 2021!

EVALUATION

Task 1. Europeana thesis abstracts and descriptions. We will provide a test set with a similar amount of abstracts in the 3 languages that have to be translated to the other 2. The data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language.

Task 2. Wikipedia cultural heritage articles. We will provide a test set with articles in Catalan that need to be translated into the other 3 languages. Similarly to Task 1, the data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language.

Automatic evaluation results will be reported per language pair and in average. The final ranking will be done according to the average translation quality per subtask, that is, per family and not per pair. When possible, we will also perform human evaluation at document level on a subset of paragraphs/sentences. If you plan to participate, please, contact us so that we can plan the human evaluation.

Since the official evaluation will be done per family, participants can participate in only one of the tasks if preferred, but they are expected to submit translations for all the languages involved in the task.

SUBMISSION

Cooming soon!

IMPORTANT DATES

Release of initial training data April 5, 2021
Additional training data deadline     May 19, 2021
Release of test data June 29, 2021
Results submission deadline July 6, 2021
Paper submission deadline August 5, 2021
Paper notification September 5, 2021
Camera-ready version due September 15, 2021
Conference at EMNLP November 10-11, 2021

ORGANISERS

Anastasija Amann (DFKI GmbH, Germany)
Kwabena Amponsah-Kaakyire (DFKI GmbH, Germany)
Cristina España-Bonet (DFKI GmbH, Germany)
Josef van Genabith (DFKI GmbH, Germany)
Leonie Harter (DFKI GmbH, Germany)

Contact

Interested in the task? Join the WMT google group for questions or comments and/or email cristinae aatt dfki.de.

ACKNOWLEDGEMENTS

This shared task is funded by the European Language Resource Coordination ELRC (SMART 2019/1083) and LT-BRIDGE (H2020, 952194), and supported by the Directorate-General for Language Policy, Ministry of Culture. Government of Catalonia. We are thankful to Europeana for providing source texts in Icelandic, Norwegian and Swedish.