With recent improvements of MT quality, we decided to move away from testing only on news domain and we are shifting the WMT focus on testing general capabilities of MT systems. Here are the main changes:
The former News translation task of the WMT changes focus this year on evaluation of general MT capabilities.
The main difference in contrast past years is that testsets will contain multiple domains.
For this year the language pairs are:
The following is a quick guide to the language pairs (in terms of resource-level and language similarity)
High resource | Medium resource | Low resource | |
---|---|---|---|
Closely-related | uk-cs | ||
Same family | en-de, en-cs, en-ru | fr-de, uk-en | en>hr |
Distant | en-zh | en-ja | liv-en, sah-ru |
The goals of the shared translation task are:
Release of training data for shared tasks (by) | most data are released |
Test suite source texts must reach us | TBC - June |
Test data released | 21st July |
Translation submission deadline | 28th July (AoE) |
Translated test suites shipped back to test suites authors | TBC - July |
Abstract system description submission | 4th August |
We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics.
You may participate in any or all of the language pairs. For all language pairs we will test translation in both directions. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT22.
If you use additional training data (not provided by the WMT22 organisers) or existing translation systems, you must flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition as well as pretrained language models released before February 2022.
Each participant is required to submit submission paper, which should highlight in which ways your own methods
and data differ from the standard task. You should make it clear which
tools you used, and which training sets you used.
Each participant has to submit (one page) abstract of the system description one week after the system submission deadline.
The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper.
See the Main page for the link to the submission site.
We are interested in the question of whether MT can be improved by using context beyond the sentence, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context" All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the sentence as well as its surrounding context when evaluating.
Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:
The data released for the WMT22 General MT task can be freely used for research purposes, we just ask that you cite the WMT22 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.
We aim to use publicly available sources of data wherever possible. Our main sources of training data are the Europarl corpus, the UN corpus, the news-commentary corpus and the ParaCrawl corpus. We also release a monolingual News Crawl corpus. Other language-specific corpora will be made available.
You may also use the following monolingual corpora released by the LDC:Note that the released data is not tokenized and includes sentences of any length (including empty sentences). All data is in Unicode (UTF-8) format. The following Moses tools allow the processing of the training data into tokenized format:
tokenizer.perl
detokenizer.perl
lowercase.perl
wrap-xml.perl
To evaluate your system during development, we suggest using previous test sets. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.
The 2022 test sets will be created from a sample of up to four domains (most likely news, e-commerce, social, and conversational) with equal number of sentences per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations. This is in contrast to the test sets up to and including 2018, which were a 50-50 mixture of test sets produced in this way, and test sets produced in the reverse direction (i.e. with the original text on the target side).
NEW: You can download all corpora via command line approach here with detailed instructions.
Except two datasets marked as 'Register and Download' (CzEng2.0, and CCMT). Usage:
pip install mtdata==0.3.7
wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-constrained.yml
for ri in wmt22-{csen,deen,jaen,ruen,zhen,frde,hren,liven,uken,ukcs,sahru}; do
mtdata get-recipe -ri $ri -o $ri
done
File | CS-EN | DE-EN | JA-EN | RU-EN | ZH-EN | FR-DE | HR-EN | LIV-EN | UK-EN | UK-CS | SAH-RU | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Europarl v10 | ✓ | ✓ | ✓ | |||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Updated: Japanese. The zh-en, ru-en, and fr-de are ParaCrawl "bonus releases". The ja-en version of ParaCrawl (JParaCrawl v3) was prepared by NTT. Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used. | ||||
✓ | ✓ | ✓ | ✓ | Same as last year. The fr-de version is here | ||||||||
News Commentary v16 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
CzEng 2.0 | ✓ | Register and download CzEng 2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details. | ||||||||||
Yandex Corpus | ✓ | 2022-07-08: Also available here | ||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Updated for 2021 | ||||||
✓ | ✓ | Register and download | ||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | de-en and cs-en contain document information. | ||||||
✓ | Register and download Same as CWMT corpus from last year. | |||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | We release the official version, with added language identification (from cld2). | |||
✓ | ✓ | All uk-en | All uk-cs | We allow OPUS only for selected languages in constrained track. For hr-en you may want to use Serbian data | ||||||||
✓ | ✓ | ✓ | Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018. | |||||||||
✓ | Note: English side is lowercased. | |||||||||||
✓ | ||||||||||||
✓ | From IWSLT 2017 Evaluation Campaign. | |||||||||||
✓ | ✓ | Added 2022-06-08 | ||||||||||
✓ | Added 2022-06-08 |
Corpus | CS | DE | EN | FR | JA | RU | ZH | HR | UK | Notes |
---|---|---|---|---|---|---|---|---|---|---|
News crawl | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Updated Large corpora of crawled news, collected since 2007. Versions up to 2019 are as before For de,cs and en, versions are available with document boundaries, and without sentence-splitting. |
News discussions | ✓ | ✓ | Corpora crawled from comment sections of online newspapers. Available in English and French (no longer updated). | |||||||
Europarl v10 | ✓ | ✓ | ✓ | ✓ | Monolingual version of European parliament crawl. Superset of the parallel version. | |||||
News Commentary | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v16. | ||
Common Crawl | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available. | ||
Extended Common Crawl | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Extended Common Crawl extracted from crawls up to April 2020. | |||
UberText Corpus | ✓ | Text crawled from Ukrainian periodicals | ||||||||
Leipzig Corpora | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Leipzig Corpora Collection: From 100 to 200 Languages PDF |
Legal Ukrainian | ✓ | Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament |
We have released development data for the tasks that are new this year. It is created in the same way as the test. Note that the dev data contains both forward and reverse translations (clearly marked).
We use an xml format (instead of the previous sgm format) for all dev, test and submission files. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping. We will provide tools.
The news-test2011 set has three additional Czech translations that you may want to use. You can download them from Charles University.
Extra references (both translated and paraphrased) for the English to German WMT19 test set have been contributed by Google Research.
Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation
quality from annotators, taking the document context into account.
In the unlikely event of an unprecedented number of system submissions that we couldn't evaluate,
we may decide to preselect systems for human evaluation by automatic metrics
(especially not evaluating low-performing unconstrained systems). However, we believe this won't be applied and all primary systems will be evaluated by humans.
For queries, please use the mailing list or contact tomkocmi@microsoft.com.
This task would not have been possible without the sponsorship of monolingual data, test sets translation and evaluation from our partners. Namely Microsoft, Charles University, Toloka, NTT Resonant, LinguaCustodia, Webinterpret, Google, and CyberAgent. As well as funding from the European Union's Horizon 2020 research and innovation programme under grant agreement 825299 (GoURMET) and 825460 (Elitr). French-German testsets has been funded by the French Ministry of Defense. Additionally, we would like to thank Loïc Barrault, Markus Freitag, Jesús Gonzáles Rubio, Raheel Qader, and many others their coordination on behalf of our partners.