EMNLP 2015 Tenth Workshop on Statistical Machine Translation

EMNLP 2015 TENTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Metrics of Machine Translation Quality

17-18 September 2015
Lisbon, Portugal

Metrics Task Important Dates

System outputs ready to download	May 4, 2015
Submission deadline for metrics task	May 25, 2015
Start of manual evaluation period	May 4, 2015
End of manual evaluation	June 1, 2015
Paper submission deadline	June 28, 2015

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the reference human translations. You will return your automatic metric scores for each of the translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your rankings with WMT15 human judgements once the manual evaluation has been completed.

Goals

The goals of the shared metrics task are:

To achieve the strongest correlation with human judgments of translation quality
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluations
To address the problems associated with comparing against a single reference translation
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking

Task Description

We will provide you with the output of machine translation systems for five different language pairs (French-English, Finnish-English, German-English, Czech-English, Russian-English), and will give you the reference translations in each of those languages. You will compute scores for each of the outputs at the system-level and the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.

We will measure the goodness of automatic evaluation metrics in the following ways:

System-level correlation: We will use Pearson's correlation coefficient to measure the correlation of the automatic metrics' scores with the official human scores as computed in the translation task.
Sentence-level correlation: We will use Kendall's tau to measure metrics' correlation with human judgments at the sentence-level. For every pairwise comparison of two systems' output for a single sentence, we will count the automatic metric as being concordant with the human judgment if it orders the systems' output the same way (i.e. the metric assigned a higher score to the higher ranked system). We will exclude pairs that the human annotators ranked as ties.

Download

All WMT15 translation task submissions, including systems from the tuning task are available here:

WMT15 system outputs incl. sources and references (29 MB)

All the metrics submissions together with the scripts to reproduce the results in the metrics task paper are available here:

WMT15 metrics results and scripts (60 MB)

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

Output file format for system-level rankings

The output files for system-level rankings should be formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SYSTEM LEVEL SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set (given by the directory structure in the plain text files, newstest2015 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SYSTEM LEVEL SCORE is the overall system level score that your metric is predicting.

Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SEGMENT NUMBER>   <SEGMENT SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set (given by the directory structure in the plain text files, newstest2015 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SEGMENT NUMBER is the line number starting from 1 of the plain text input files.
SEGMENT SCORE is the score your metric predicts for the particular segment.

Each field should be delimited by a single tab character.

Past Years' Data

The system outputs and human judgments from the previous workshops are available for download from the following links:

WMT08: http://www.statmt.org/wmt08/results.html
WMT09: http://www.statmt.org/wmt09/results.html
WMT10: http://www.statmt.org/wmt10/results.html
WMT11: http://www.statmt.org/wmt11/results.html
WMT12: http://www.statmt.org/wmt12/results.html
WMT13: http://www.statmt.org/wmt13/results.html
WMT14: http://www.statmt.org/wmt14/results.html

You can use them to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from past years.

Last year's data contains all of the system's translations, the source documents and reference human translations and the human judgments of the translation quality.

Other Requirements

If you participate in the evaluation shared task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Metrics Task Organizers

Ondřej Bojar (Charles University in Prague)
Miloš Stanojević (University of Amsterdam, ILLC)
Amir Kamran (University of Amsterdam, ILLC)

Supported by the European Commision
under the
project (grant number 288487)

EMNLP 2015 TENTH WORKSHOPON STATISTICAL MACHINE TRANSLATION