2020 Fifth Conference on Machine Translation (WMT20)

Home

NEW: Draft proceedings available from links above (updated 2020-12-22)

This conference builds on a series of annual workshops and conferences on statistical machine translation, going back to 2006:

IMPORTANT DATES

Release of training data for shared tasks	February/March, 2020
Evaluation periods for shared tasks	May/June, 2020
Paper submission deadline	August 15, 2020 (Anywhere on Earth)
Paper notification	September 29, 2020
Camera-ready version due	October 10, 2020
Online Conference	November 19-20, 2020

OVERVIEW

This year's conference will feature the following shared tasks:

a news translation task
a biomedical translation task
a similar language translation task
an unsupervised and very low resource translation task
an automatic post-editing task
a metrics task (assess MT quality given reference translation)
a quality estimation task (assess MT quality without access to any reference)
a parallel corpus filtering and alignment task
a lifelong learning in MT task. NEW
a chat translation task. NEW

In addition to the shared tasks, the conference will also feature scientific papers on topics related to MT. Topics of interest include, but are not limited to:

MT models (neural, statistical etc. )
analysis of neural models for MT
using comparable corpora for MT
selection and preparation of data for MT
semi-supervised and unsupervised learning for MT, transfer learning
multilingual MT
incorporating linguistic information into MT
MT inference
manual and automatic methods for evaluating MT
quality estimation for MT

We encourage authors to evaluate their approaches to the above topics using the common data sets created for the shared tasks.

REGISTRATION AND VISA INFORMATION

These will both be handled by EMNLP 2020.

NEWS TRANSLATION TASK

This shared task will examine translation between the following language pairs:

English to/from Chinese
English to/from Czech (both directions again)
English to/from German
English to/from Inuktitut
English to/from Japanese
English to/from Polish
English to/from Russian
English to/from Tamil

Additional language pairs are still to be confirmed.

The text for all the test sets will be drawn from news articles. Participants may submit translations for any or all of the language directions. In addition to the common test sets the conference organizers will provide optional training resources.

Development sets for the new language pairs, and training data for all pairs, will be made available in January/February 2020. There will be a mixture of high and low resource language pairs, and we expect also to include an unsupervised translation task, as well as allowing multilingual systems.

All submitted systems will undergo human evaluation, and participating teams are expected to contribute to this evaluation.

The news task is supported by Microsoft, NTT and the University of Tokyo, Tilde, National Research Council of Canada, Yandex and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299 (Gourmet).

BIOMEDICAL TRANSLATION TASK

In this fifth edition of this task, we plan to evaluate systems for the translation of biomedical abstracts for the following languages pairs:

English-French and French-English
English-Portuguese and Portuguese-English
English-Spanish and Spanish-English
English-German and German-English
English-Chinese and Chinese-English
English-Italian and Italian-English
English-Russian and Russian-English
English-Basque

As well as translation of biomedical terminologies for the following language pair:

English-Basque

Parallel corpora will be available for all language pairs but also monoligual corpora for some languages. Evaluation will be carried out both automatically and manually.

SIMILAR LANGUAGE TRANSLATION TASK

The task is organized to evaluate the performance of state-of-the-art MT systems on translating between pairs of languages from the same language family. We will provide participants with training and testing data from similar languages of different language families. Evaluation will be carried out using automatic evaluation metrics and human evaluation.

UNSUPERVISED AND VERY LOW RESOURCE TASK

These two subtasks focus on German to Upper Sorbian (and Upper Sorbian to German) translation. Unsupervised machine translation requires only monolingual data. We also offer a very low resource supervised translation task. Evaluation will be carried out using automatic evaluation metrics.

AUTOMATIC POST-EDITING TASK

This task will focus on the automatic correction of machine translation outputs given a corpus of (source, target, human post-edit) triplets as training material.

METRICS TASK

In this task, participants develop software that can assign a score to the output of MT, based on the reference translation or without access to the reference (the "Quality Estimation as a Metric" track). Metrics are assessed on their correlation with human judgement.

QUALITY ESTIMATION TASK

This consists of several sub-tasks, all of which are concerned with the idea of assessing the quality of MT output without using a reference, at different levels of granularity and including different language pairs, from low to high resource languages.

LIFELONG LEARNING MT TASK

This task will address the issue of auto-adapting and auto-evaluating MT system across time, i.e. with a stream of incoming data. It will be based on previous News MT tasks (EN-DE and EN-FR) with an evaluation protocol taking the system performance across time into account.

CHAT TRANSLATION

In the chat translation task we aim at addressing a different type of text in which there is a dialogue between [at least] two speakers and once the sentence is uttered there is a limited possibility to revise it. In this scenario, due to its nature, the sentences tend to be very short with a large number of references to the previous sentences. This makes it necessary to use document-level information for translating the sentences, which makes it more challenging. The parallel data used for training and evaluating the systems belongs to the customer support domain and will be available for the English-German and English-French language pairs.

PAPER SUBMISSION INFORMATION

Submissions will consist of regular full papers of 6-10 pages, plus additional pages for references. Formatting will follow EMNLP 2020 guidelines. Supplementary material can be added to research papers. In addition, shared task participants will be invited to submit short papers (suggested length: 4-6 pages, plus references) describing their systems or their evaluation metrics. Both submission and review processes will be handled electronically. Note that regular papers must be anonymized, while system descriptions should not be.

Research papers that have been or will be submitted to other meetings or publications must indicate this at submission time, and must be withdrawn from the other venues if accepted and published at WMT 2020. We will not accept for publication papers that overlap significantly in content or results with papers that have been or will be published elsewhere. It is acceptable to submit work that has been made available as a technical report (or similar, e.g. in arXiv) without citing it. This double submission policy only applies to research papers, so system papers can have significant overlap with other published work, if it is relevant to the system description.

We encourage individuals who are submitting research papers to evaluate their approaches using the training resources provided by this conference and past workshops, so that their experiments can be repeated by others using these publicly available corpora.

POSTER FORMAT

We expect that posters will be presented as short talks. Details TBC.

ANNOUNCEMENTS

Subscribe to to the announcement list for WMT by entering your e-mail address below. This list will be used to announce when the test sets are released, to indicate any corrections to the training sets, and to amend the deadlines as needed.

Email:

You can read past announcements on the Google Groups page for WMT. These also include an archive of announcements from earlier workshops.

INVITED TALK

The Invited talk was given by Masakhane, entitled Low-resourcedness Beyond Data. A video of the talk is available on YouTube.

ORGANIZERS

Loïc Barrault (University of Sheffield)
Ondřej Bojar (Charles University in Prague)
Fethi Bougares (University of Le Mans)
Rajen Chatterjee (Apple)
Marta R. Costa-jussà (Universitat Politècnica de Catalunya
Christian Federmann (MSR)
Mark Fishel (University of Tartu)
Alexander Fraser (LMU Munich)
Yvette Graham (DCU)
Romann Grundkiewicz (MSR)
Paco Guzman (Facebook)
Barry Haddow (University of Edinburgh)
Matthias Huck (LMU Munich)
Antonio Jimeno Yepes (IBM Research Australia)
Tom Kocmi (MSR)
Philipp Koehn (University of Edinburgh / Johns Hopkins University)
André Martins (Unbabel)
Makoto Morishita (NTT)
Christof Monz (University of Amsterdam)
Masaaki Nagata (NTT)
Toshiaki Nakazawa (University of Tokyo)
Matteo Negri (FBK)
Aurélie Névéol (LIMSI, CNRS)
Mariana Neves (German Federal Institute for Risk Assessment)
Martin Popel (Charles University in Prague)
Matt Post (Johns Hopkins University)
Marco Turchi (FBK)
Marcos Zampieri (Rochester Institute of Technology)

ANTI-HARASSMENT POLICY

WMT follows the ACL's anti-harassment policy

CONTACT

For general questions, comments, etc. please send email to bhaddow@inf.ed.ac.uk.
For task-specific questions, please contact the relevant organisers.