Robustness Task - ACL 2019 Fourth Conference on Machine Translation

Shared Task: Machine Translation Robustness

This is a translation task of the WMT workshop focusing on robustness of machine translation to noisy input text. The language pairs are:

French to/from English
Japanese to/from English

GOALS

Non-standard, noisy text of the kind that can be found in social media and the internet is ubiquitous. Yet, existing machine translation systems struggle with handling the idiosyncrasies of this type of input. The goal of this shared task is to provide a testbed for improving MT models' robustness to orthographic variations, grammar errors, and other linguistic phenomena common in noisy, user-generated content, via better modelling, adaptation technique or leveraging monolingual training data.

Specifically, the shared task aims to bring improvements on the following challenges:

To improve NMT's robustness to orthographic variations, grammatical errors, informal language and other linguistic phenomena or noise common on social media.
To explore effective approaches to leverage abundant out-of-domain parallel data.
To explore novel approaches to leverage abundant monolingual data on the Web (e.g. tweets, Reddit comments, commoncrawl, etc.).
To thoroughly investigate and understand challenges in translating social media text and identify promising future directions.

IMPORTANT DATES

Release of training/dev data	January 21, 2019
Test data released	April 12, 2019
Translation submission deadline	April 29, 2019 (23:59 UTC-12)
System description paper submission deadline	May 17, 2019
End of evaluation	July 2, 2019

TASK DESCRIPTION

We provide training and dev data from the same domain distribution (Reddit comments) for all language pairs. In addition, we also provide pointers to more data sources focusing on the following two aspects:

Utilizing out-of-domain data

You are highly encouraged to submit systems which train with large amounts of parallel corpora with distinct distribution from the test domain. We provide pointers to past WMT training corpora.

Utilizing monolingual data

You are highly encouraged to develop novel solutions to utilize monolingual corpora (both in-domain and out-of-domain) to improve translation quality.

You can focus on either or both aspects for your submission.

Constrained submission is highly encouraged (see definition below):

Constrained submission: You are encouraged to only use the datasets we provided in the Data section.
Unconstrained submission: We also welcome unconstrained submissions i.e. you are also welcome to use additional datasets. If you do so, please flag all the unconstrained data sources used in your system, and make sure they are publicly available and can be acquired for free. However, we discourage the collection of additional data from the same source (Reddit) so as to prevent any potential overlap with our blind test set.

You are also welcome to use text-normalization tools to preprocess train/dev/test data. If you do so, please flag the normalization tool you used, and make sure they have open-sourced code and can be acquired for free.

We also encourage participation purely focused on the text normalization aspect. If you are interested, please contact us and we will provide a pretrained baseline MT system to generate translations.

You may participate in either or both language pair.

DATA

TRAINING DATA

In-domain data:

You may only use the MTNT dataset as in-domain parallel data, for all languages pairs. We also encourage participants to use the monolingual data available in MTNT. However, we discourage the collection of additional data from the same source (Reddit) so as to prevent any potential overlap with our blind test set.

Out-of-domain data:

English-French: you may use the parallel data made available for the WMT15 news translation task for training. We also encourage the use of monolingual data from that same task.
Japanese-English: you may use the training data used in the MTNT paper for training, namely the KFTT, TED and JESC corpora.

DEVELOPMENT DATA

In-domain data:

Both the validation and test sets provided in MTNT can be used as development data for in-domain, noisy text.

Out-of-domain data:

English-French: You may use all development and test data allowed for the WMT15 shared task.
English-Japanese: You may use the validation and test data of KFTT, TED and JESC.

NEW: TEST DATA

You can download the blind test sets.

The zip archive contains 3 files:

    en-fr.blind.tsv
    en-ja.blind.tsv
    fr-en.blind.tsv
    ja-en.blind.tsv

Each file is tab separated with 3 rows:

The first row is a unique number identifying each sentence
The second row is a number identifying comments. Some sentences come from the same reddit comments. Sentences are ordered as they were found in each comment. Should you want to, you may use this information to leverage context from sentences that come from the same comment.
The third and last row contains the source sentence.

NEW: the test sets with reference translations are now available: MTNT2019.tar.gz. The format is the same as the blind test sets with one additional column for the translation.

DOWNLOAD

WMT15 en-fr
KFTT
TED
JESC
KFTT+TED+JESC (version used in the MTNT paper, for convenience)
MTNT

TEST SET SUBMISSION

Translation output should be submitted as real case, detokenized, and in SGML format.

For English-Japanese, your raw text output needs to be segmented with Kytea (version 0.4.7 recommended), first:

kytea -model /path/to/kytea/share/kytea/model.bin -out tok YOUR_OUTPUT > YOUR_OUTPUT_TOK

To convert plain text output into the proper format, download the SGML versions of the source files and the script wrap-xml.perl. With that at hand, you can convert your output with

wrap-xml.perl LANG SOURCE_SGM < YOUR_OUTPUT > YOUR_OUTPUT_SGM

where

LANG is the output language, either en, ja, or fr
SOURCE_SGM is the SGML source file
YOUR_OUTPUT and YOUR_OUTPUT_SGM are your system's output in raw text and SGML format.

Please upload this file to the website following steps below:

Go to the website matrix.statmt.org.
Create an account under the menu item Account -> Create Account.
Go to Account -> upload/edit content, and follow the link "Submit a system run"
- select as test set "mtnt2019" and the language pair you are submitting
- select "create new system"
- click "continue"
- on the next page, upload your file and add some description. Don't forget to indicate whether you are submitting a constrained or unconstrained system.

If you are submitting contrastive runs, please submit your primary system first and mark it clearly as the primary submission. For system description paper submission, please follow the instruction in PAPER SUBMISSION INFORMATION.

EVALUATION

Evaluation will be done both automatically as well as by human judgement. Constrained and unconstrained systems will be evaluated and compared separately.

Manual Scoring: We will collect subjective judgments about translation quality from human annotators.
We expect the translated submissions to be in recased, detokenized, XML format, just as in most other translation campaigns (NIST, TC-Star).

ORGANIZERS

Antonios Anastasopoulos, University of Notre Dame and CMU
Yonatan Belinkov, Harvard and MIT
Nadir K. Durrani, QCRI
Orhan Firat, Google
Philipp Koehn, JHU
Paul Michel, Carnegie Mellon University
Graham Neubig, Carnegie Mellon University
Xian Li, Facebook
Juan Miguel Pino, Facebook
Hassan Sajjad, QCRI

Questions or comments can be posted at wmt-tasks@googlegroups.com.