Translation Task - EMNLP Seventh Conference on Machine Translation

Shared Task: General Machine Translation

News translation task is changing to General MT task

With recent improvements of MT quality, we decided to move away from testing only on news domain and we are shifting the WMT focus on testing general capabilities of MT systems. Here are the main changes:

Multi-domain direction: blind tests for all languages will contain (up to) four different domains. Likely news, social, conversational, and ecommerce.
Aggregated blind testsets across tasks: we cooperate with other tasks to combine blindsets, thus participants of all shared tasks would translate also blindsets from other shared tasks, they do not participate directly if given language pair will be shared. This will allow comparisons of systems across tasks (for example how general MT task systems compete against biomedical submissions). Each shared task will evaluate their blindset separately. Blindsets from different tasks won't be used for human evaluation and ranking of systems at General MT task.
Rethinking human evaluation: TBA
Dropping requirement expecting participants to annotate 10 hours per primary system.
Instead, participants of General MT task are required to submit system description paper.

If you have any questions, please, refer to this living document or write an email to the organizers.

The former News translation task of the WMT changes focus this year on evaluation of general MT capabilities. The main difference in contrast past years is that testsets will contain multiple domains.
For this year the language pairs are:

Chinese to/from English
Czech to/from English
German to/from English
Japanese to/from English
Russian to/from English
German to/from French
English to Croatian (one direction only)
Livonian to/from English
Yakut to/from Russian (sah-ru)
Ukrainian to/from English
Ukrainian to/from Czech

We provide parallel corpora for all languages as training data, and additional resources for download.

The following is a quick guide to the language pairs (in terms of resource-level and language similarity)

	High resource	Medium resource	Low resource
Closely-related			uk-cs
Same family	en-de, en-cs, en-ru	fr-de, uk-en	en>hr
Distant	en-zh	en-ja	liv-en, sah-ru

GOALS

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating into languages other than English
To examine special challenges in translating between language families, including word order differences and morphology
To investigate the translation of low-resource, morphologically rich languages
To create publicly available corpora for machine translation and machine translation evaluation
To generate up-to-date performance numbers in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art machine translation methods
To investigate the usefulness of multilingual and third language resources
To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

IMPORTANT DATES

Release of training data for shared tasks (by)	most data are released
Test suite source texts must reach us	TBC - June
Test data released	21st July
Translation submission deadline	28th July (AoE)
Translated test suites shipped back to test suites authors	TBC - July
Abstract system description submission	4th August

TASK DESCRIPTION

We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics.

You may participate in any or all of the language pairs. For all language pairs we will test translation in both directions. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set. You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT22.

If you use additional training data (not provided by the WMT22 organisers) or existing translation systems, you must flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition as well as pretrained language models released before February 2022.

Each participant is required to submit submission paper, which should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.
Each participant has to submit (one page) abstract of the system description one week after the system submission deadline. The abstract should contain, at a minimum, basic information about the system and the approaches/data/tools used, but could be a full description paper or a draft that can be later modified for the final system description paper. See the Main page for the link to the submission site.

Document-level MT

We are interested in the question of whether MT can be improved by using context beyond the sentence, and to what extent state-of-the-art MT systems can produce translations that are correct "in-context" All of our development and test data contains full documents, and all our human evaluation will be in-context, in other words the evaluators will view the sentence as well as its surrounding context when evaluating.

Our training data retains context and document boundaries wherever possible, in particular the following corpora retain the context intact:

Parallel: europarl, news-commentary, CzEng, Rapid
Monolingual: news-crawl (en, de and cs), europarl, news-commentary

DATA

LICENSING OF DATA

The data released for the WMT22 General MT task can be freely used for research purposes, we just ask that you cite the WMT22 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

TRAINING DATA

We aim to use publicly available sources of data wherever possible. Our main sources of training data are the Europarl corpus, the UN corpus, the news-commentary corpus and the ParaCrawl corpus. We also release a monolingual News Crawl corpus. Other language-specific corpora will be made available.

You may also use the following monolingual corpora released by the LDC:

LDC2011T07 English Gigaword Fifth Edition
LDC2009T13 English Gigaword Fourth Edition
LDC2007T07 English Gigaword Third Edition
LDC2009T27 Chinese Gigaword Fourth Edition

Note that the released data is not tokenized and includes sentences of any length (including empty sentences). All data is in Unicode (UTF-8) format. The following Moses tools allow the processing of the training data into tokenized format:

Tokenizer tokenizer.perl
Detokenizer detokenizer.perl
Lowercaser lowercase.perl
SGML Wrapper wrap-xml.perl

These tools are available in the Moses git repository.

DEVELOPMENT DATA

To evaluate your system during development, we suggest using previous test sets. For automatic evaluation, we recommend to use sacreBLEU, which will automatically download previous WMT test sets for you. You may want to consider COMET automatic metric that has been shown to have high correlation with humans. We also release other dev and test sets from previous years.

The 2022 test sets will be created from a sample of up to four domains (most likely news, e-commerce, social, and conversational) with equal number of sentences per domain. The sources of the test sets will be original text, whereas the targets will be human-produced translations. This is in contrast to the test sets up to and including 2018, which were a 50-50 mixture of test sets produced in this way, and test sets produced in the reverse direction (i.e. with the original text on the target side).

DOWNLOAD

NEW: You can download all corpora via command line approach here with detailed instructions. Except two datasets marked as 'Register and Download' (CzEng2.0, and CCMT). Usage:

pip install mtdata==0.3.7 wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-constrained.yml for ri in wmt22-{csen,deen,jaen,ruen,zhen,frde,hren,liven,uken,ukcs,sahru}; do mtdata get-recipe -ri $ri -o $ri done

Parallel data:

File	CS-EN	DE-EN	JA-EN	RU-EN	ZH-EN	FR-DE	HR-EN	LIV-EN	UK-EN	UK-CS	SAH-RU	Notes
Europarl v10	✓	✓				✓
ParaCrawl v9	✓	✓	✓	✓	✓	✓	✓		✓			Updated: Japanese. The zh-en, ru-en, and fr-de are ParaCrawl "bonus releases". The ja-en version of ParaCrawl (JParaCrawl v3) was prepared by NTT. Note that only the ticked language pairs are available for constrained participants, but the metadata (tmx files) may be used.
Common Crawl corpus	✓	✓		✓		✓						Same as last year. The fr-de version is here
News Commentary v16	✓	✓	✓	✓	✓	✓
CzEng 2.0	✓											Register and download CzEng 2.0. The new CzEng includes synthetic data, and includes all cs-en data supplied for the task. See the CzEng README for more details.
Yandex Corpus				✓								2022-07-08: Also available here
Wiki Titles v3	✓	✓	✓	✓	✓	✓						Updated for 2021
UN Parallel Corpus V1.0				✓	✓							Register and download
Tilde MODEL corpus	✓	✓		✓		✓	✓		✓			de-en and cs-en contain document information.
CCMT Corpus					✓							Register and download Same as CWMT corpus from last year.
WikiMatrix	✓	✓	✓	✓	✓	✓	✓		✓	✓		We release the official version, with added language identification (from cld2).
OPUS							✓	✓	All uk-en	All uk-cs		We allow OPUS only for selected languages in constrained track. For hr-en you may want to use Serbian data
Back-translated news	✓			✓	✓							Back-translated news. The cs-en data is contained in CzEng. The zh-en and ru-en data was produced for the University of Edinburgh systems in 2017 and 2018.
Japanese-English Subtitle Corpus			✓									Note: English side is lowercased.
The Kyoto Free Translation Task Corpus			✓
TED Talks			✓									From IWSLT 2017 Evaluation Campaign.
ELRC - EU acts in Ukrainian									✓	✓		Added 2022-06-08
Yakut parallel and monolingual data											✓	Added 2022-06-08

Monolingual training data:

Corpus	CS	DE	EN	FR	JA	RU	ZH	HR	UK	Notes
News crawl	✓	✓	✓	✓	✓	✓	✓	✓	✓	Updated Large corpora of crawled news, collected since 2007. Versions up to 2019 are as before For de,cs and en, versions are available with document boundaries, and without sentence-splitting.
News discussions			✓	✓						Corpora crawled from comment sections of online newspapers. Available in English and French (no longer updated).
Europarl v10	✓	✓	✓	✓						Monolingual version of European parliament crawl. Superset of the parallel version.
News Commentary	✓	✓	✓	✓	✓	✓	✓			Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v16.
Common Crawl	✓	✓	✓	✓	✓	✓	✓			Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available.
Extended Common Crawl	✓	✓		✓	✓	✓	✓			Extended Common Crawl extracted from crawls up to April 2020.
UberText Corpus									✓	Text crawled from Ukrainian periodicals
Leipzig Corpora	✓	✓	✓	✓	✓	✓	✓	✓	✓	Leipzig Corpora Collection: From 100 to 200 Languages PDF
Legal Ukrainian									✓	Legal Ukrainian: 69M token corpus in the legal sector; crawled from websites belonging to legislation, government, court, and parliament

Development sets
Croatian-English
Ukrainian

We have released development data for the tasks that are new this year. It is created in the same way as the test. Note that the dev data contains both forward and reverse translations (clearly marked).

We use an xml format (instead of the previous sgm format) for all dev, test and submission files. It is important to use an xml parser to wrap/unwrap text in order to ensure correct escaping/de-escaping. We will provide tools.

The news-test2011 set has three additional Czech translations that you may want to use. You can download them from Charles University.

Extra references (both translated and paraphrased) for the English to German WMT19 test set have been contributed by Google Research.

TEST SET SUBMISSION

The test data sources are available here

The sources are in xml format, which is different from the sgm format used in previous years. Scripts from converting xml to/from line-oriented text are available here
The sources contain the General MT test sets and additional testsets from “test suites” and other shared tasks, which will be used for further evaluation of the translation systems

Your translations should be submitted through OCELoT

You first need to register a team name with OCELoT. Your team will then need to be activated by General MT task organisers before you can submit. Please send an email to Tom with your OCELoT team name and your institution/company details in order to get activated.
Translations should be “human-ready”, i.e. in the form that text is normally published, so latin-script languages should be recased and detokenised, Chinese and Japanese should be unsegmented, etc.
Submissions should be formatted in the new WMT xml format, using the format tools linked above
You can make up to 7 submissions per language pair, per team. Each submission will be scored (BLEU and chrF) against a reference translation, the scores in OCELoT does not reflect actual system performance and are mainly for validation.
During the test week, all submissions will remain anonymous
Submissions should be uploaded by deadline stated above

After submissions have closed, each team will be asked to choose a primary system (generally at most one per language pair) from their submissions.

Primary submissions of teams that submit abstract paper will be the only ones included in the human evaluation
To select primaries, log in to OCELoT, select the Team tab at the top, and click on the yellow "Team submissions" button.
When choosing a primary system, you will be asked to give a short (one paragraph) description of the system, and fill in a web form with some details of technologies used.
Once we have a stable set of primary submissions, we will de-anonymise the primary submissions, and only the primary submissions. We will also release the references.

Each team must submit an abstract paper by August 4th and later a full system paper describing your submission.

See the Main page for details on abstract and paper submission.

EVALUATION

Primary systems (for which abstracts have been submitted) will be included in the human evaluation. We will collect subjective judgments about the translation quality from annotators, taking the document context into account.

In the unlikely event of an unprecedented number of system submissions that we couldn't evaluate, we may decide to preselect systems for human evaluation by automatic metrics (especially not evaluating low-performing unconstrained systems). However, we believe this won't be applied and all primary systems will be evaluated by humans.

CONTACT

For queries, please use the mailing list or contact tomkocmi@microsoft.com.

ORGANIZERS

Tom Kocmi - tomkocmi@microsoft.com
Rachel Bawden
Ondřej Bojar
Anton Dvorkovich
Christian Federmann
Mark Fishel
Thamme Gowda
Yvette Graham
Roman Grundkiewicz
Barry Haddow
Rebecca Knowles
Philipp Koehn
Martin Popel
Mariya Shmatova

ACKNOWLEDGEMENTS

This task would not have been possible without the sponsorship of monolingual data, test sets translation and evaluation from our partners. Namely Microsoft, Charles University, Toloka, NTT Resonant, LinguaCustodia, Webinterpret, Google, and CyberAgent. As well as funding from the European Union's Horizon 2020 research and innovation programme under grant agreement 825299 (GoURMET) and 825460 (Elitr). French-German testsets has been funded by the French Ministry of Defense. Additionally, we would like to thank Loïc Barrault, Markus Freitag, Jesús Gonzáles Rubio, Raheel Qader, and many others their coordination on behalf of our partners.