System outputs ready to download | May 1, 2016 |
Start of manual evaluation period | May 2, 2016 |
Paper submission deadline | |
Submission deadline for metrics task | May 22, 2016 |
End of manual evaluation | May 22, 2016 |
Notification of acceptance | June 5, 2016 |
Camera-ready deadline | June 22, 2016 |
Conference in Berlin | August 11-12th, 2016 |
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the reference human translations. You will return your automatic metric scores for each of the translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your rankings with WMT16 human judgements once the manual evaluation has been completed.
The goals of the shared metrics task are:
Metrics Task goes crazy this year. The good news is that if you do not aim at bleeding edge performance, you will be affected minimally:
File formats are not changed (see below), only the TEST SET
should include the track name.
If you do want to provide bleeding-edge results, you may want to know a bit more about the composition of the test sets, system sets, ways of evaluation and the training data we provide.
In short, we are adding "tracks" to cover:
The madness is fully summarized in a live Google sheet.
You can easily identify the track by the test set label (e.g. “RRsegNews+
”) and based on that, you may want to use a variant of your metric adapted for the task, e.g. tuned on a different development set. Training data are listed below.
Remember to describe the exact setup of your metric used for each of the tracks in your metric paper.
We will provide you with the output of machine translation systems and reference translations (2 references for Finnish, 1 for others) for several language pairs involving English and the following languages: Basque, Bulgarian, Czech, Dutch, Finnish, German, Polish, Portuguese, Romanian, Russian, Spanish, and Turkish. You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.
We will measure the goodness of automatic evaluation metrics in the following ways:
System-level correlation: We will use Pearson's correlation coefficient to measure the correlation of the automatic metrics' scores with the official human scores as computed in the translation task. (There will be two variants of official scoring this year, Relative Ranking and Direct Assessment.)
Sentence-level correlation: There will be three types of golden truths in segment/sentence-level evaluation. "Relative ranking" will use the same method as last year, a variation on Kendall's tau counting pairs of sentences ranked the same way by humans and your metric (concordant pairs). "Direct assessment" will use Pearson correlations of your scores with (absolute) human judgements of translation quality. "HUMEseg" will use Pearson correlation of your segment-level scores with human judgments of semantic nodes, aggregated over each sentence.
The following table summarizes the planned evaluation methods and text domains of each evaluation track.
Track | Text Domain | Level | Golden Truth Source |
---|---|---|---|
RRsysNews | news, from WMT16 news task | system-level | relative ranking |
RRsysIT | IT, from WMT16 IT task | system-level | relative-ranking |
DAsysNews | news, from WMT16 news task | system-level | direct assessment |
RRsegNews | news, from WMT16 news task | segment-level | relative ranking |
DAsegNews | news, from WMT16 news task | segment-level | direct assessment |
HUMEseg | (consumer) medical, from HimL | segment-level | correctness of translation of all semantic nodes |
If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.
You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
WMT16 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the packages are quite big.
See the Google sheet if you want to take part in only some of the languages or tracks and do not want to download more than needed.
Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.
There are two references for English-to-Finnish newstest: newstest2016-enfi-ref.fi
and newstest2016-enfi-ref.fiB
. You are free to use both; if you use only one, please pick former variant.
To take part in a particular language pair (seg-level or sys-level), download the package for the language pair (as we are adding them):
This loop downloads all the packages (10 GB): for lp in cs-en de-en en-bg en-cs en-de en-es en-eu en-fi en-nl en-pl en-pt en-ro en-ru en-tr fi-en ro-en ru-en tr-en; do wget http://ufallab.ms.mff.cuni.cz/~bojar/wmt16-metrics-task-data/wmt16-metrics-inputs-for-$lp.tar.bz2; done
By downloading the above packages, you have everything for that language pair.
Each package contains one or more test sets (their source, e.g. newstest2016-csen-src.cs
, reference newstest2016-csen-ref.en
) and system outputs for each of the test sets (e.g. newstest2016.online-B.0.cs-en
). Along with the normal MT systems, there are 10k hybrid systems for the newstest2016 stored in the directories H0
through H9
and/or 10k hybrid systems for the ittest2016 stored in the directories I0
through I9
.
The filename of each system follows the pattern TESTSET.SYSTEMNAME.SYSTEMID.SRC-TGT
, including the hybrids which differ only in their IDs. All filenames across the whole metrics task are unique, but do not put more than 10k files in a directory.
For system-level evaluation, you need to score all systems, including the hybrid ones. For segment-level evaluation, you need to score only the normal systems and you can ignore the [HI]*
directories.
If you want to participate only in segment-level metrics, we do not need the 10k extra systems, so the package is smaller and includes all languages:
You may want to use some of the following dataset to tune or train your metric.
The system outputs and human judgments from the previous workshops are available for download from the following links:
You can use them to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from past years.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
There are no specific training data for RRsysNews vs. RRsysIT. (Or put differently, you have to resort to news-based RR data also for RRsysIT).
For segment-level, we provide a development set of 500 sentences translated from Czech, German, Finnish and Russian (500 each) into English (translations were sampled at random from outputs of all systems participating in WMT15 translation task). The dataset contains:
The package is available here:
There are some direct assessments judgements for system-level English<->Spanish, but this language pairs is not among the tested pairs this year. Contact Yvette Graham if you are interested in this dataset.
There are no training data for the HUMEseg track.
To give you at least some background, the golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.
There is only one system output per segment.
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2016
, ittest2016
and himltest
) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.
If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: RRsysNews
, RRsysIT
, DAsysNews
, RRsegNews
, DAsegNews
and HUMEseg
; see above.
The output files for system-level rankings should be called YOURMETRIC.sys.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SYSTEM LEVEL SCORE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (RRsysNews+newstest2016
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SYSTEM LEVEL SCORE
is the overall system level score that your metric is predicting.
The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SEGMENT NUMBER> <SEGMENT SCORE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2015
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SEGMENT NUMBER
is the line number starting from 1 of the plain text input files.SEGMENT SCORE
is the score your metric predicts for the particular segment.Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.
In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.
Supported by the European Commision under the project (grant number 645452)