ACL 2016 First Conference on Machine Translation (WMT16)

WMT16 Metrics Task

Metrics Task Important Dates

System outputs ready to download	May 1, 2016
Start of manual evaluation period	May 2, 2016
Paper submission deadline	~~May 8~~ May 15, 2016
Submission deadline for metrics task	May 22, 2016
End of manual evaluation	May 22, 2016
Notification of acceptance	June 5, 2016
Camera-ready deadline	June 22, 2016
Conference in Berlin	August 11-12th, 2016

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the reference human translations. You will return your automatic metric scores for each of the translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your rankings with WMT16 human judgements once the manual evaluation has been completed.

Goals

The goals of the shared metrics task are:

To achieve the strongest correlation with human judgments of translation quality,
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluations,
To address the problems associated with comparing against a single reference translation,
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking.

Changes This Year

Metrics Task goes crazy this year. The good news is that if you do not aim at bleeding edge performance, you will be affected minimally:

The set of MT systems in each language pair will be much much larger. (Expect 10k systems, not just 20 per language pair).
The set of language pairs will be larger.
The set of test sets (underlying sets of sentences) will be larger and more varied.

File formats are not changed (see below), only the TEST SET should include the track name.

If you do want to provide bleeding-edge results, you may want to know a bit more about the composition of the test sets, system sets, ways of evaluation and the training data we provide.

In short, we are adding "tracks" to cover:

a new domain (IT) with "traditional" golden annotations (relative ranking)
a new style of golden annotations for system-level as well as for segment-level judgements (“direct assessment”)
a new domain (medical) and a new golden annotations for this domain

The madness is fully summarized in a live Google sheet.

You can easily identify the track by the test set label (e.g. “RRsegNews+”) and based on that, you may want to use a variant of your metric adapted for the task, e.g. tuned on a different development set. Training data are listed below.

Remember to describe the exact setup of your metric used for each of the tracks in your metric paper.

Task Description

We will provide you with the output of machine translation systems and reference translations (2 references for Finnish, 1 for others) for several language pairs involving English and the following languages: Basque, Bulgarian, Czech, Dutch, Finnish, German, Polish, Portuguese, Romanian, Russian, Spanish, and Turkish. You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.

We will measure the goodness of automatic evaluation metrics in the following ways:

System-level correlation: We will use Pearson's correlation coefficient to measure the correlation of the automatic metrics' scores with the official human scores as computed in the translation task. (There will be two variants of official scoring this year, Relative Ranking and Direct Assessment.)
Sentence-level correlation: There will be three types of golden truths in segment/sentence-level evaluation. "Relative ranking" will use the same method as last year, a variation on Kendall's tau counting pairs of sentences ranked the same way by humans and your metric (concordant pairs). "Direct assessment" will use Pearson correlations of your scores with (absolute) human judgements of translation quality. "HUMEseg" will use Pearson correlation of your segment-level scores with human judgments of semantic nodes, aggregated over each sentence.

Summary of Tracks

The following table summarizes the planned evaluation methods and text domains of each evaluation track.

Track Text Domain Level Golden Truth Source

RRsysNews news, from WMT16 news task system-level relative ranking

RRsysIT IT, from WMT16 IT task system-level relative-ranking

DAsysNews news, from WMT16 news task system-level direct assessment

RRsegNews news, from WMT16 news task segment-level relative ranking

DAsegNews news, from WMT16 news task segment-level direct assessment

HUMEseg (consumer) medical, from HimL segment-level correctness of translation of all semantic nodes

Track	Text Domain	Level	Golden Truth Source
RRsysNews	news, from WMT16 news task	system-level	relative ranking
RRsysIT	IT, from WMT16 IT task	system-level	relative-ranking
DAsysNews	news, from WMT16 news task	system-level	direct assessment
RRsegNews	news, from WMT16 news task	segment-level	relative ranking
DAsegNews	news, from WMT16 news task	segment-level	direct assessment
HUMEseg	(consumer) medical, from HimL	segment-level	correctness of translation of all semantic nodes

Other Requirements

If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Download

Test Sets (Evaluation Data)

WMT16 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the packages are quite big.

See the Google sheet if you want to take part in only some of the languages or tracks and do not want to download more than needed.

Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.

There are two references for English-to-Finnish newstest: newstest2016-enfi-ref.fi and newstest2016-enfi-ref.fiB. You are free to use both; if you use only one, please pick former variant.

Packages per Language Pair

To take part in a particular language pair (seg-level or sys-level), download the package for the language pair (as we are adding them):

wmt16-metrics-inputs-for-cs-en.tar.bz2 (741M)
wmt16-metrics-inputs-for-de-en.tar.bz2 (759M)
wmt16-metrics-inputs-for-en-bg.tar.bz2 (157M)
wmt16-metrics-inputs-for-en-cs.tar.bz2 (963M)
wmt16-metrics-inputs-for-en-de.tar.bz2 (1.1G)
wmt16-metrics-inputs-for-en-es.tar.bz2 (151M)
wmt16-metrics-inputs-for-en-eu.tar.bz2 (108M)
wmt16-metrics-inputs-for-en-fi.tar.bz2 (809M)
wmt16-metrics-inputs-for-en-nl.tar.bz2 (146M)
wmt16-metrics-inputs-for-en-pl.tar.bz2 (77K)
wmt16-metrics-inputs-for-en-pt.tar.bz2 (146M)
wmt16-metrics-inputs-for-en-ro.tar.bz2 (565M)
wmt16-metrics-inputs-for-en-ru.tar.bz2 (1.1G)
wmt16-metrics-inputs-for-en-tr.tar.bz2 (825M)
wmt16-metrics-inputs-for-fi-en.tar.bz2 (798M)
wmt16-metrics-inputs-for-ro-en.tar.bz2 (480M)
wmt16-metrics-inputs-for-ru-en.tar.bz2 (841M)
wmt16-metrics-inputs-for-tr-en.tar.bz2 (245M)

This loop downloads all the packages (10 GB): for lp in cs-en de-en en-bg en-cs en-de en-es en-eu en-fi en-nl en-pl en-pt en-ro en-ru en-tr fi-en ro-en ru-en tr-en; do wget http://ufallab.ms.mff.cuni.cz/~bojar/wmt16-metrics-task-data/wmt16-metrics-inputs-for-$lp.tar.bz2; done

By downloading the above packages, you have everything for that language pair.

Each package contains one or more test sets (their source, e.g. newstest2016-csen-src.cs, reference newstest2016-csen-ref.en) and system outputs for each of the test sets (e.g. newstest2016.online-B.0.cs-en). Along with the normal MT systems, there are 10k hybrid systems for the newstest2016 stored in the directories H0 through H9 and/or 10k hybrid systems for the ittest2016 stored in the directories I0 through I9.

The filename of each system follows the pattern TESTSET.SYSTEMNAME.SYSTEMID.SRC-TGT, including the hybrids which differ only in their IDs. All filenames across the whole metrics task are unique, but do not put more than 10k files in a directory.

For system-level evaluation, you need to score all systems, including the hybrid ones. For segment-level evaluation, you need to score only the normal systems and you can ignore the [HI]* directories.

Package for Segment-Level Metrics Only

If you want to participate only in segment-level metrics, we do not need the 10k extra systems, so the package is smaller and includes all languages:

wmt16-metrics-inputs-for-seg-level-only.tar.bz2(16M)

Training Data

You may want to use some of the following dataset to tune or train your metric.

RR (Relative Ranking) from Past Years

The system outputs and human judgments from the previous workshops are available for download from the following links:

WMT08: http://www.statmt.org/wmt08/results.html
WMT09: http://www.statmt.org/wmt09/results.html
WMT10: http://www.statmt.org/wmt10/results.html
WMT11: http://www.statmt.org/wmt11/results.html
WMT12: http://www.statmt.org/wmt12/results.html
WMT13: http://www.statmt.org/wmt13/results.html
WMT14: http://www.statmt.org/wmt14/results.html
WMT15: http://www.statmt.org/wmt15/results.html

You can use them to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from past years.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

There are no specific training data for RRsysNews vs. RRsysIT. (Or put differently, you have to resort to news-based RR data also for RRsysIT).

DA (Direct Assessment) Training Data

For segment-level, we provide a development set of 500 sentences translated from Czech, German, Finnish and Russian (500 each) into English (translations were sampled at random from outputs of all systems participating in WMT15 translation task). The dataset contains:

the source English sentence
MT output (blind, no identification of the actual system that produced it)
the reference translation
human score (a real number between -1.9 and 1.3)
sBLEU (a real number between 0 and 1); for comparison

The package is available here:

~~wmt2016-seg-metric-dev.tar.gz (312KB)~~
wmt2016-seg-metric-dev-5lps.tar.gz (412KB, added English-Russian judgements, May 4, 2016)

There are some direct assessments judgements for system-level English<->Spanish, but this language pairs is not among the tested pairs this year. Contact Yvette Graham if you are interested in this dataset.

HUMEseg

There are no training data for the HUMEseg track.

To give you at least some background, the golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.

There is only one system output per segment.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2016, ittest2016 and himltest) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.

If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: RRsysNews, RRsysIT, DAsysNews, RRsegNews, DAsegNews and HUMEseg; see above.

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SYSTEM LEVEL SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set optionally including the evaluation track (RRsysNews+newstest2016 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SYSTEM LEVEL SCORE is the overall system level score that your metric is predicting.

Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz and formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SEGMENT NUMBER>   <SEGMENT SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2015 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SEGMENT NUMBER is the line number starting from 1 of the plain text input files.
SEGMENT SCORE is the score your metric predicts for the particular segment.

Each field should be delimited by a single tab character.

How to submit

Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.

In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.

Metrics Task Organizers

Miloš Stanojević (University of Amsterdam, ILLC)
Amir Kamran (University of Amsterdam, ILLC)
Yvette Graham (Dublin City University)
Ondřej Bojar (Charles University in Prague)

Acknowledgement

Supported by the European Commision under the project (grant number 645452)