Language domains that require very careful use of terminology are abundant. The need to adequately translate within such domains is undeniable, as shown by e.g. the different WMT shared tasks on biomedical translation.
More interestingly, as the abundance of research on domain adaptation shows, such language domains are (a) not adequately covered by existing data and models, while (b) new (or “surge”) domains arise and models need to be adapted, often with significant downstream implications: consider the new COVID-19 domain and the large efforts for translation of critical information regarding pandemic handling and infection prevention strategies.
In the case of newly developed domains, while parallel data are hard to come by, it is fairly straightforward to create word- or phrase-level terminologies, which can be used to guide professional translators and ensure both accuracy and consistency.
This shared task will replicate such a scenario, and invites participants to explore methods to incorporate terminologies into either the training or the inference process, in order to improve both the accuracy and consistency of MT systems on a new domain.
Release of training data and terminologies | April 2021 |
Suprise languages announced: | June 28, 2021 |
Test set available | |
Submission of translations | |
System descriptions due | August 5, 2021 |
Camera-ready for system descriptions | September 15, 2021 |
Conference in Punta Cana | November 10-11, 2021 |
We clarify that the blind test sets will not include any terminology annotations. It will fall on the participants' systems/pipelines to identify the source terms given the terminologies.
The submission report should highlight in which ways participants’ methods and data differ from the standard MT approach. They should make clear which tools were used, and which training sets were used.
You may participate in any or all of the language pairs.
File | EN-ZH | EN-FR | EN-RU | EN-KO | CS-DE | Notes |
---|---|---|---|---|---|---|
Europarl v10 | ✓ | ✓ | Now with metadata. Text is unchanged. The cs-de file is from v8. | |||
✓ | ✓ | ✓ | ✓ | ✓ | Updated for 2021 Further details on ParaCrawl, including tmx files, available at the ParaCrawl website. The zh-en ParaCrawl is a "bonus release". Metadata (tmx files) may be used. | |
News Commentary v16 | ✓ | ✓ | ✓ | ✓ | Updated for 2021 | |
✓ | ✓ | Updated for 2021 | ||||
✓ | ✓ | ✓ | Register and download | |||
✓ | Register and download Same as CWMT corpus from last year, new host. | |||||
✓ | ✓ | ✓ | We release the official version, with added language identification (from cld2). | |||
✓ | ✓ | Back-translated news. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018. | ||||
✓ | ✓ | From 2013. Used in WMT 2015 News Translation Task. | ||||
✓ | ||||||
Yandex Corpus | ✓ | |||||
DGT v2019 | ✓ | |||||
CC-Aligned | ✓ | |||||
MultiCC-Aligned | ✓ |
Corpus | ZH | FR | RU | KO | DE | CS | Notes |
---|---|---|---|---|---|---|---|
News crawl | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Updated Large corpora of crawled news, collected since 2007. Versions up to 2019 are as before. |
News Commentary | ✓ | ✓ | ✓ | ✓ | ✓ | Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v16. | |
Extended Common Crawl | ✓ | ✓ | ✓ | ✓ | ✓ | Extended Common Crawl extracted from crawls up to April 2020. |
Note:You may not use the TICO-19 dataset or the EMEA corpus, as they will be part of the evaluation suite.
Id | sourceLang | targetLang | sourceString | targetString |
1 | en | fr | 1918 flu | Grippe de 1918 |
2 | en | fr | acute bronchitis | bronchite aiguë |
3 | en | fr | acute respiratory disease | maladie respiratoire aiguë |
4 | en | fr | AIDS | SIDA |
5 | en | fr | airborne droplets | Gouttelettes en suspension dans l'air |
The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:
Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example: wrap-xml.perl en termonology-test-2021.src.en.sgm Google < decoder-output > decoder-output.sgm
[Update July 21: BLIND TEST SETS for all languages are available!] You can download them here!
<tstset trglang="en" setid="newstest2019" srclang="any">
, with trglang set to either fr, ru, en.
Important: srclang is always any.
Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example: wrap-xml.perl en newstest20120src.de.sgm Amazon < decoder-output > decoder-output.sgm
Please email your submissions to the following two organizer emails: {antonis,malam21}[at]gmu[dot]edu, using [WMT Terminologies Submission] in the subject.
We will send you test results with all metrics (and current rankings) as soon as possible!
Accuracy: We will evaluate the translations with standard MT metrics (BLEU, chrF, BERTscore, COMET)
Consistency: we will also perform terminology-targeted evaluation. Details on the metrics are available in this paper and the code of the metrics is available here.
As in other shared tasks, we expect the translated submissions to be in recased, detokenized, XML-like format.