EMNLP 2017 Second Conference on Machine Translation (WMT17s)

Shared Task: Metrics

Metrics Task Important Dates

System outputs ready to download	~~May 14th, 2017~~ June 3rd, 2017
Start of manual evaluation period	~~May 15th, 2017~~ June 16th, 2017
End of manual evaluation (provisional)	~~June 4th, 2017~~ June 23rd, 2017
Paper submission deadline	~~June 9th, 2017~~ extended to June 17th, 2017 (AoE)
Submission deadline for metrics task	~~June 15th, 2017~~ extended to June 21, 2017 (AoE, indeed later than the paper)
Notification of acceptance	June 30th, 2017
Camera-ready deadline	July 14th, 2017
Conference in Copenhagen	September 7-8, 2017

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. You will return your automatic metric scores for translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your scores with WMT17 human judgements once the manual evaluation has been completed.

Goals

The goals of the shared metrics task are:

To achieve the strongest correlation with human judgement of translation quality;
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluation;
To address problems associated with comparison with a single reference translation;
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking.

Changes This Year

Submissions to this year's metrics task should include in each submission:

Metric Speed (system-level only): A start and end timestamp should be provided for submissions to facilitate analysis of metrics ability to achieve a strong correlation with human assessment and the possible trade-off in terms of reduction in speed. Inclusion of timestamps will allow a rough analysis of this relationship for metric submissions. Precisely how to include the timestamps in your submission files is provided below.
Ensemble information (system and sentence-level): this year there will be a distinction between metrics that employ at least one other existing metric in their formulation (ensemble) and metrics that do not employ any other existing metric (non-ensemble).
There will also be a distinction between metrics that are freely available and those that are not. We ask that you submit the appropriate URL in the case of availability.

As trialed in WMT16, the system-level evaluation will optionally include evaluation of metrics with reference to large sets of 10k MT hybrid systems.

We will also include a medical domain evaluation of metrics on the sentence-level via HUME manual evaluation based on UCCA.

Task Description

We will provide you with the output of machine translation systems and reference translations for language pairs involving English and the following languages

in the news domain: Chinese, Czech, Finnish, German, Latvian, Russian, Turkish (newstest2017)
in a mix of news and medical domains: Czech, German, Polish and Romanian (himltest17). WARNING: This testset includes sentences from WMT16 newstest. If your metric is trained and you included WMT16 newstest in your training data, please let us know.

You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.

We will assess automatic evaluation metrics in the following ways:

System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
Sentence-level correlation: There will be two types of golden truths in segment/sentence-level evaluation. "Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality for translations in the news domain. "HUME" will employ the Pearson correlation of your segment-level scores with human judgments of semantic nodes, aggregated over each sentence, for translations in the medical domain.

Summary of Tracks

The following table summarizes the planned evaluation methods and text domains of each evaluation track.

Track Text Domain Level Golden Truth Source

DAsys news, from WMT17 news task system-level direct assessment

DAseg news, from WMT17 news task segment-level direct assessment

HUMEseg mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) segment-level correctness of translation of all semantic nodes

HUMEsys mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) system-level aggregate correctness of translation of all semantic nodes

Track	Text Domain	Level	Golden Truth Source
DAsys	news, from WMT17 news task	system-level	direct assessment
DAseg	news, from WMT17 news task	segment-level	direct assessment
HUMEseg	mix of (consumer) medical from HimL and news (WARNING: WMT16 news task)	segment-level	correctness of translation of all semantic nodes
HUMEsys	mix of (consumer) medical from HimL and news (WARNING: WMT16 news task)	system-level	aggregate correctness of translation of all semantic nodes

Other Requirements

If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Download

Test Sets (Evaluation Data)

WMT17 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the package is quite big.

We have changed the format of hybrid systems inputs, see the file wmt17-metrics-task/hybrids/hybrid-instructions in the package for description. We plan to provide a wrapper for TXT format to run your metric on the hybrid systems.

If possible, please submit results for all systems, including the hybrids. If you know you won't have the resources to run the hybrids, you can use the smaller package:

wmt17-metrics-task.tgz (248MB)
wmt17-metrics-task-no-hybrids.tgz (46MB; please do not use unless inevitable)

Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.

There are two references for English-to-Finnish newstest: newstest2017-enfi-ref.fi and newstestB2017-enfi-ref.fi. You are free to use both; if you use only one, please pick former variant.

Training Data

You may want to use some of the following data to tune or train your metric.

DA (Direct Assessment) Development/Training Data

For system-level, see last year's results

WMT16: http://www.statmt.org/wmt16/results.html

For segment-level, there are two past development sets available covering

DAseg-wmt-newstest2016.tar.gz: 7 language pairs (sampled from newstest2016, tr-en fi-en cs-en ro-en ru-en en-ru de-en; always 560 sentence pairs)

DAseg-wmt-newstest2015.tar.gz: 5 language pairs (sampled from newstest2015, en-ru de-en ru-en fi-en cs-en; always 500 sentence pairs)

Each dataset contains:

the source sentence
MT output (blind, no identification of the actual system that produced it)
the reference translation
human score (a real number between -Inf and +Inf)

HUMEseg

For HUMEseg training data see last year's metrics task results

WMT16: http://www.statmt.org/wmt16/results.html, the package called "Metrics Task data and results" with these files:

800 segments of source English: wmt16-metrics-results/seg-level-results/hume-files/inputs/HUMEseg/source
800 segments of candidate and reference translations (one system per language): wmt16-metrics-results/seg-level-results/hume-files/inputs/HUMEseg/{cs,de,pl,ro}.{hyp,ref}
330-349 segments with manual score: wmt16-metrics-results/seg-level-results/hume-files/hume-human/hume.himl.en-{cs,de,pl,ro}.csv; each of each of the file shows segment index (indexed hopefully from 1) and the HUME score for that segment

For HUMEseg, golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.

In contrast to previous year, there will be a handful of system outputs per segment. (A different set of systems for each language pair.)

RR (Relative Ranking) from Past Years

Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:

WMT16: http://www.statmt.org/wmt16/results.html
WMT15: http://www.statmt.org/wmt15/results.html
WMT14: http://www.statmt.org/wmt14/results.html
WMT13: http://www.statmt.org/wmt13/results.html
WMT12: http://www.statmt.org/wmt12/results.html
WMT11: http://www.statmt.org/wmt11/results.html
WMT10: http://www.statmt.org/wmt10/results.html
WMT09: http://www.statmt.org/wmt09/results.html
WMT08: http://www.statmt.org/wmt08/results.html

You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2017 and himltest) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.

If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: DAsys, DAseg and HUMEseg; see above.

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SYSTEM LEVEL SCORE> <BEGIN TIMESTAMP>   <END TIMESTAMP> <ENSEMBLE>   <AVAILABLE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set optionally including the evaluation track (DAsys+newstest2017 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SYSTEM LEVEL SCORE is the overall system level score that your metric is predicting.
BEGIN TIMESTAMP is the time at which your metric began processing the raw test data in Epoch seconds (1493196388 for a start time was 26 Apr 2017 08:46:28 GMT, for example).
END TIMESTAMP is the time at which your metric finished processing the raw test data in Epoch seconds(1493196486 for an end time of 26 Apr 2017 08:48:06 GMT).
ENSEMBLE information about whether or not your metric employs any other existing metric or not (ensemble if yes, non-ensemble if not).
AVAILABLE public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval for example or no if if it's not available yet.

Each field should be delimited by a single tab character.

Timestamps should be in Epoch seconds, ie. using the "date +%s" command (Linux) or equivalent. We will use the two timestamps to work out the rough total duration in seconds for your metric to produce scores for the system-level submissions. To avoid inconsistencies across submissions, we request timestamps at the very beginning (and end) of processing the raw data, i.e. before all preprocessing such as tokenization (for both MT output and reference translations) so that this is consistently included in durations for all metrics.

Output file format for segment-level rankings

The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz and formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SEGMENT NUMBER>   <SEGMENT SCORE> <ENSEMBLE>   <AVAILABLE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2017 for example).
SYSTEM is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
SEGMENT NUMBER is the line number starting from 1 of the plain text input files.
SEGMENT SCORE is the score your metric predicts for the particular segment.
ENSEMBLE information about whether or not your metric employs any other existing metric or not (ensemble if yes, non-ensemble if not).
AVAILABLE public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval for example or no if if it's not available yet.

Each field should be delimited by a single tab character.

Note: fields ENSEMBLE and AVAILABLE should be filled with the same value in every line of the submission file for a given metric. Inclusion in this format involves some redundancy but avoids adding extra files to the submission requirements.

How to submit

Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.

In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.

Metrics Task Organizers

Ondřej Bojar (Charles University in Prague)
Yvette Graham (Dublin City University)
Amir Kamran (University of Amsterdam, ILLC)

Acknowledgement

Supported by the European Commission under the project (grant number 645452)