System outputs ready to download | |
Start of manual evaluation period | |
End of manual evaluation (provisional) | |
Paper submission deadline | |
Submission deadline for metrics task | |
Notification of acceptance | June 30th, 2017 |
Camera-ready deadline | July 14th, 2017 |
Conference in Copenhagen | September 7-8, 2017 |
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. You will return your automatic metric scores for translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your scores with WMT17 human judgements once the manual evaluation has been completed.
The goals of the shared metrics task are:
Submissions to this year's metrics task should include in each submission:
As trialed in WMT16, the system-level evaluation will optionally include evaluation of metrics with reference to large sets of 10k MT hybrid systems.
We will also include a medical domain evaluation of metrics on the sentence-level via HUME manual evaluation based on UCCA.
We will provide you with the output of machine translation systems and reference translations for language pairs involving English and the following languages
You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.
We will assess automatic evaluation metrics in the following ways:
System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
Sentence-level correlation: There will be two types of golden truths in segment/sentence-level evaluation. "Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality for translations in the news domain. "HUME" will employ the Pearson correlation of your segment-level scores with human judgments of semantic nodes, aggregated over each sentence, for translations in the medical domain.
The following table summarizes the planned evaluation methods and text domains of each evaluation track.
Track | Text Domain | Level | Golden Truth Source |
---|---|---|---|
DAsys | news, from WMT17 news task | system-level | direct assessment |
DAseg | news, from WMT17 news task | segment-level | direct assessment |
HUMEseg | mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) | segment-level | correctness of translation of all semantic nodes |
HUMEsys | mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) | system-level | aggregate correctness of translation of all semantic nodes |
If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.
You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
WMT17 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the package is quite big.
We have changed the format of hybrid systems inputs, see the file wmt17-metrics-task/hybrids/hybrid-instructions
in the package for description. We plan to provide a wrapper for TXT format to run your metric on the hybrid systems.
If possible, please submit results for all systems, including the hybrids. If you know you won't have the resources to run the hybrids, you can use the smaller package:
Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.
There are two references for English-to-Finnish newstest: newstest2017-enfi-ref.fi
and newstestB2017-enfi-ref.fi
. You are free to use both; if you use only one, please pick former variant.
You may want to use some of the following data to tune or train your metric.
For system-level, see last year's results
For segment-level, there are two past development sets available covering
Each dataset contains:
For HUMEseg training data see last year's metrics task results
wmt16-metrics-results/seg-level-results/hume-files/inputs/HUMEseg/source
wmt16-metrics-results/seg-level-results/hume-files/inputs/HUMEseg/{cs,de,pl,ro}.{hyp,ref}
wmt16-metrics-results/seg-level-results/hume-files/hume-human/hume.himl.en-{cs,de,pl,ro}.csv
; each of each of the file shows segment index (indexed hopefully from 1) and the HUME score for that segmentFor HUMEseg, golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.
In contrast to previous year, there will be a handful of system outputs per segment. (A different set of systems for each language pair.)
Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:
You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2017
and himltest
) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.
If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: DAsys
, DAseg
and HUMEseg
; see above.
The output files for system-level rankings should be called YOURMETRIC.sys.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SYSTEM LEVEL SCORE> <BEGIN TIMESTAMP> <END TIMESTAMP> <ENSEMBLE> <AVAILABLE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (DAsys+newstest2017
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SYSTEM LEVEL SCORE
is the overall system level score that your metric is predicting.
BEGIN TIMESTAMP
is the time at which your metric began processing the raw test data in Epoch seconds (1493196388
for a start time was 26 Apr 2017 08:46:28 GMT, for example).
END TIMESTAMP
is the time at which your metric finished processing the raw test data in Epoch seconds(1493196486
for an end time of 26 Apr 2017 08:48:06 GMT).
ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes, non-ensemble
if not).
AVAILABLE
public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval
for example or no
if if it's not available yet.
Timestamps should be in Epoch seconds, ie. using the "date +%s" command (Linux) or equivalent. We will use the two timestamps to work out the rough total duration in seconds for your metric to produce scores for the system-level submissions. To avoid inconsistencies across submissions, we request timestamps at the very beginning (and end) of processing the raw data, i.e. before all preprocessing such as tokenization (for both MT output and reference translations) so that this is consistently included in durations for all metrics.
The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SEGMENT NUMBER> <SEGMENT SCORE> <ENSEMBLE> <AVAILABLE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2017
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SEGMENT NUMBER
is the line number starting from 1 of the plain text input files.SEGMENT SCORE
is the score your metric predicts for the particular segment.ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes, non-ensemble
if not).
AVAILABLE
public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval
for example or no
if if it's not available yet.
Note: fields ENSEMBLE
and AVAILABLE
should be filled with the same value in every line of
the submission file for a given metric. Inclusion in this format involves some redundancy but avoids adding extra files
to the submission requirements.
Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.
In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.
Supported by the European Commission under the project (grant number 645452)