Shared Task: Machine Translation using Terminologies

Language domains that require very careful use of terminology are abundant. The need to adequately translate within such domains is undeniable, as shown by e.g. the different WMT shared tasks on biomedical translation.

More interestingly, as the abundance of research on domain adaptation shows, such language domains are (a) not adequately covered by existing data and models, while (b) new (or “surge”) domains arise and models need to be adapted, often with significant downstream implications: consider the new COVID-19 domain and the large efforts for translation of critical information regarding pandemic handling and infection prevention strategies.

In the case of newly developed domains, while parallel data are hard to come by, it is fairly straightforward to create word- or phrase-level terminologies, which can be used to guide professional translators and ensure both accuracy and consistency.

This shared task will replicate such a scenario, and invites participants to explore methods to incorporate terminologies into either the training or the inference process, in order to improve both the accuracy and consistency of MT systems on a new domain.

IMPORTANT DATES

Release of training data and terminologies   April 2021
Suprise languages announced:June 28, 2021
Test set availableJuly 19, 2021July 22, 2021
Submission of translationsJuly 23, 2021July 29, 2021
System descriptions dueAugust 5, 2021
Camera-ready for system descriptionsSeptember 15, 2021
Conference in Punta CanaNovember 10-11, 2021

SETTINGS

In this shared task, we will distinguish submissions that use the terminology only at inference time (e.g., for constrained decoding or something similar) and submissions that use the terminology at training time (e.g., for data selection, data augmentation, explicit training, etc). Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

We clarify that the blind test sets will not include any terminology annotations. It will fall on the participants' systems/pipelines to identify the source terms given the terminologies.

The submission report should highlight in which ways participants’ methods and data differ from the standard MT approach. They should make clear which tools were used, and which training sets were used.

LANGUAGE PAIRS

The shared task will focus on four language pairs, with systems evaluated: We will provide training/development data and terminologies for the above language pairs. Test sets will be released at the beginning of the evaluation period. The goal of this setting (with both development and surprise language pairs) is to avoid approaches that overfit on language selection, and instead evaluate the more realistic scenario of needing to tackle the new domain in a new language in a limited amount of time. The surprise language pairs will be announced 3 weeks before the start of the evaluation campaigns. At the same time we will provide training data and terminologies for the surprise language pairs.

You may participate in any or all of the language pairs.

DATA

Participants may use any parallel or monolingual data listed in previous versions of WMT shared tasks to train their systems. They may only use such data (the shared task only allows constrained submissions). Pre-trained systems listed (mBERT, XLM, XLM-R, mBART, mT5, M2M100) are also allowed, but should be disclosed by the participants. The training data for English-Chinese matches the data condition of the WMT 2021 News translation task.

Parallel data

File EN-ZH EN-FR EN-RU EN-KO CS-DE Notes
Europarl v10       Now with metadata. Text is unchanged. The cs-de file is from v8.
ParaCrawl v7.1 Updated for 2021 Further details on ParaCrawl, including tmx files, available at the ParaCrawl website. The zh-en ParaCrawl is a "bonus release". Metadata (tmx files) may be used.
News Commentary v16   Updated for 2021
Wiki Titles v3       Updated for 2021
UN Parallel Corpus V1.0     Register and download
CCMT Corpus         Register and download Same as CWMT corpus from last year, new host.
WikiMatrix     We release the official version, with added language identification (from cld2).
Back-translated news       Back-translated news. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018.
Common Crawl corpus       From 2013. Used in WMT 2015 News Translation Task.
109French-English corpus         From 2010. Used in WMT 2015 News Translation Task.
Yandex Corpus        
DGT v2019        
CC-Aligned        
MultiCC-Aligned        

Monolingual Data

Corpus ZH FR RU KO DE CS Notes
News crawl Updated Large corpora of crawled news, collected since 2007. Versions up to 2019 are as before.
News Commentary   Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v16.
Extended Common Crawl   Extended Common Crawl extracted from crawls up to April 2020.

Note:You may not use the TICO-19 dataset or the EMEA corpus, as they will be part of the evaluation suite.

TERMINOLOGIES

We will provide terminologies for all language pairs. The terminologies will be provided as a simple tab-separated-values (.tsv) file with the following format:

IdsourceLangtargetLangsourceStringtargetString
1enfr1918 fluGrippe de 1918
2enfracute bronchitisbronchite aiguë
3enfracute respiratory diseasemaladie respiratoire aiguë
4enfrAIDSSIDA
5enfrairborne dropletsGouttelettes en suspension dans l'air

Download Terminologies here! [Updated: June 29][Updated to include Korean and minor revisions in Russian: July 21]

DEVELOPMENT AND TEST SET FORMAT

[June 9]: Development sets are available! You can download them here.
[June 29]: Development sets for EN-RU and CS-DE are available! You can download them here. The EN-KO development sets will be available in two weeks.
[July 21]: Development sets for EN-KO are available! You can download all development sets here.
[Update July 21: BLIND TEST SETS for all languages are available!] You can download them here! The development and test sets follow the same XML schema as used in the WMT news translation shared tasks.

The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:

Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example: wrap-xml.perl en termonology-test-2021.src.en.sgm Google < decoder-output > decoder-output.sgm

The XML schema will additionally mark the terms over the source and target sentences that will be used for terminology-targeted evaluation. One sentence can include multiple terms. For an example on the term XML schema, see the following source (English) sentence:

which matches the following target (French) sentence:

[Update July 21: BLIND TEST SETS for all languages are available!] You can download them here!

TEST SET SUBMISSION

Each submitted file has to be in a format that is used by standard scoring scripts such as NIST BLEU or TER. This format is similar to the one used in the source test set files that were released, except for: The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:

Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT

Example: wrap-xml.perl en newstest20120src.de.sgm Amazon < decoder-output > decoder-output.sgm

Please email your submissions to the following two organizer emails: {antonis,malam21}[at]gmu[dot]edu, using [WMT Terminologies Submission] in the subject.
We will send you test results with all metrics (and current rankings) as soon as possible!

EVALUATION METRICS

Evaluation will be done automatically and it will focus on both translation accuracy and consistency.

Accuracy: We will evaluate the translations with standard MT metrics (BLEU, chrF, BERTscore, COMET)

Consistency: we will also perform terminology-targeted evaluation. Details on the metrics are available in this paper and the code of the metrics is available here.

As in other shared tasks, we expect the translated submissions to be in recased, detokenized, XML-like format.

ORGANIZERS

Antonis Anastasopoulos, George Mason University
Md Mahfuz ibn Alam, George Mason University
Laurent Besacier, NAVER
James Cross, Facebook
Georgiana Dinu, AWS
Marcello Federico, AWS
Matthias Gallé, NAVER
Philipp Koehn, Facebook / Johns Hopkins University
Ivana Kvapilíková, Charles University
Vassilina Nikoulina, NAVER
Kweon Woo Jung, NAVER

SPONSORS