Shared Task: Metrics

26-27 June 2014
Baltimore, USA


This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the reference human translations. You will return your automatic metric scores for each of the translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your rankings with WMT14 human judgements once the manual evaluation has been completed.


The goals of the shared metrics task are:

Changes This Year

This year we will use Pearson's correlation coefficient (instead of Spearman's) to evaluate system-level metrics

Task Description

We will provide you with the output of machine translation systems for five different language pairs (French-English, Hindi-English, German-English, Czech-English, Russian-English), and will give you the reference translations in each of those languages. You will compute scores for each of the outputs at the system-level and the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and cannot work with translations into languages other than English, you are free to assign scores only for translations into English.

We will measure the goodness of automatic evaluation metrics in the following ways:

Submission Format

Once we receive the system outputs from the translation task we will post all of the system outputs for you to score with your metric. The translations will be distributed as plain text files with one translation per line.

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

Output file format for system-level rankings

The output files for system-level rankings should be formatted in the following way:

Where: Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be formatted in the following way:

Where: Each field should be delimited by a single tab character.

Past Years' Data

The system outputs and human judgments from the last three years' workshops is available for download from the following links:

You can use them to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from past years.

Last year's data contains all of the system's translations, the source documents and reference human translations and the human judgments of the translation quality.

Other Requirements

If you participate in the evaluation shared task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.


System outputs distributed for metrics task (download tarball here) March 7, 2014
Submission deadline for metrics task (email to 28, 2014
Paper submission deadlineApril 1, 2014

Supported by the European Commision
under the
project (grant number 288487)