Shared Task: Machine Translation Robustness
This is a translation task of the WMT workshop focusing on robustness of machine translation to noisy input text. The language pairs are:
- French to/from English
- Japanese to/from English
GOALS
Non-standard, noisy text of the kind that can be found in social media and the internet is ubiquitous. Yet, existing machine translation systems struggle with handling the idiosyncrasies of this type of input. The goal of this shared task is to provide a testbed for improving MT models' robustness to orthographic variations, grammar errors, and other linguistic phenomena common in noisy, user-generated content, via better modelling, adaptation technique or leveraging monolingual training data.
Specifically, the shared task aims to bring improvements on the following challenges:
- To improve NMT's robustness to orthographic variations, grammatical errors, informal language and other linguistic phenomena or noise common on social media.
- To explore effective approaches to leverage abundant out-of-domain parallel data.
- To explore novel approaches to leverage abundant monolingual data on the Web (e.g. tweets, Reddit comments, commoncrawl, etc.).
- To thoroughly investigate and understand challenges in translating social media text and identify promising future directions.
Release of training/dev data |
January 21, 2019 |
Test data released |
April 12, 2019 |
Translation submission deadline |
April 29, 2019 (23:59 UTC-12) |
System description paper submission deadline |
May 17, 2019 |
End of evaluation |
July 2, 2019 |
We provide training and dev data from the same domain distribution (Reddit comments) for all language pairs. In addition, we also provide pointers to more data sources focusing on the following two aspects:
Utilizing out-of-domain data
You are highly encouraged to submit systems which train with large amounts of parallel corpora with distinct distribution from the test domain. We provide pointers to past WMT training corpora.
Utilizing monolingual data
You are highly encouraged to develop novel solutions to utilize monolingual corpora (both in-domain and out-of-domain) to improve translation quality.
You can focus on either or both aspects for your submission.
Constrained submission is highly encouraged (see definition below):
- Constrained submission: You are encouraged to only use the datasets we provided in the Data section.
- Unconstrained submission: We also welcome unconstrained submissions i.e. you are also welcome to use additional datasets. If you do so, please flag all the unconstrained data sources used in your system, and make sure they are publicly available and can be acquired for free. However, we discourage the collection of additional data from the same source (Reddit) so as to prevent any potential overlap with our blind test set.
You are also welcome to use text-normalization tools to preprocess train/dev/test data. If you do so, please flag the normalization tool you used, and make sure they have open-sourced code and can be acquired for free.
We also encourage participation purely focused on the text normalization aspect. If you are interested, please contact us and we will provide a pretrained baseline MT system to generate translations.
You may participate in either or both language pair.
TRAINING DATA
In-domain data:
- You may only use the MTNT dataset as in-domain parallel data, for all languages pairs. We also encourage participants to use the monolingual data available in MTNT. However, we discourage the collection of additional data from the same source (Reddit) so as to prevent any potential overlap with our blind test set.
Out-of-domain data:
- English-French: you may use the parallel data made available for the WMT15 news translation task for training.
We also encourage the use of monolingual data from that same task.
- Japanese-English: you may use the training data used in the MTNT paper for training, namely the KFTT, TED and JESC corpora.
DEVELOPMENT DATA
In-domain data:
- Both the validation and test sets provided in MTNT can be used as development data for in-domain, noisy text.
Out-of-domain data:
- English-French: You may use all development and test data allowed for the WMT15 shared task.
- English-Japanese: You may use the validation and test data of KFTT, TED and JESC.
NEW: TEST DATA
You can download the blind test sets.
The zip archive contains 3 files:
en-fr.blind.tsv
en-ja.blind.tsv
fr-en.blind.tsv
ja-en.blind.tsv
Each file is tab separated with 3 rows:
- The first row is a unique number identifying each sentence
- The second row is a number identifying comments. Some sentences come from the same reddit comments. Sentences are ordered as they were found in each comment. Should you want to, you may use this information to leverage context from sentences that come from the same comment.
- The third and last row contains the source sentence.
NEW: the test sets with reference translations are now available: MTNT2019.tar.gz. The format is the same as the blind test sets with one additional column for the translation.
Translation output should be submitted as real case, detokenized, and in SGML format.
For English-Japanese, your raw text output needs to be segmented with Kytea (version 0.4.7 recommended), first:
kytea -model /path/to/kytea/share/kytea/model.bin -out tok YOUR_OUTPUT > YOUR_OUTPUT_TOK
To convert plain text output into the proper format, download the
SGML versions of the source files
and the script wrap-xml.perl.
With that at hand, you can convert your output with
wrap-xml.perl LANG SOURCE_SGM < YOUR_OUTPUT > YOUR_OUTPUT_SGM
where
LANG
is the output language, either en
, ja
, or fr
SOURCE_SGM
is the SGML source file
YOUR_OUTPUT
and YOUR_OUTPUT_SGM
are your system's output in raw text and SGML format.
Please upload this file to the website following steps below:
- Go to the website matrix.statmt.org.
- Create an account under the menu item Account -> Create Account.
- Go to Account -> upload/edit content, and follow the link "Submit a system run"
- select as test set "mtnt2019" and the language pair you are submitting
- select "create new system"
- click "continue"
- on the next page, upload your file and add some description. Don't forget to indicate whether you are submitting a constrained or unconstrained system.
If you are submitting contrastive runs, please submit your primary system first and mark it clearly as the primary submission.
For system description paper submission, please follow the instruction in PAPER SUBMISSION INFORMATION.
Evaluation will be done both automatically as well as by human judgement. Constrained and unconstrained systems will be evaluated and compared separately.
- Manual Scoring: We will collect subjective judgments about translation quality from human annotators.
- We expect the translated submissions to be in recased, detokenized, XML format, just as in most other translation campaigns (NIST, TC-Star).
- Antonios Anastasopoulos, University of Notre Dame and CMU
- Yonatan Belinkov, Harvard and MIT
- Nadir K. Durrani, QCRI
- Orhan Firat, Google
- Philipp Koehn, JHU
- Paul Michel, Carnegie Mellon University
- Graham Neubig, Carnegie Mellon University
- Xian Li, Facebook
- Juan Miguel Pino, Facebook
- Hassan Sajjad, QCRI
Questions or comments can be posted at wmt-tasks@googlegroups.com.