System outputs ready to download | |
Start of manual evaluation period | June 11th, 2018 |
End of manual evaluation | July 2nd, 2018 (tentative) |
Submission deadline for metrics task | |
Paper submission deadline | July 27th, 2018 |
Notification of acceptance | August 18th, 2018 |
Camera-ready deadline | August 31th, 2018 |
Conference in Brussels | October 31st—November 1st, 2018 |
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. You will return your automatic metric scores for translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your scores with WMT18 human judgements once the manual evaluation has been completed.
The goals of the shared metrics task are:
Submissions to this year's metrics task should include in each submission:
Since 2016, the system-level evaluation includes evaluation of metrics against large sets of references (10k synthetic, "hybrid" MT systems). If your system-level metric is not terribly computationally expensive, please provide also your scores for the 10k hybrid MT systems.
This year, there are no additional domains (e.g. the medical domain last year).
We will provide you with the output of machine translation systems and reference translations for language pairs involving English and the following languages
You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.
We will assess automatic evaluation metrics in the following ways:
System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
Sentence-level correlation: "Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality. (Fallback to Kendall's tau on "relative ranking" implied from direct assessments may be necessary for some language pairs, as done in 2017.)
The following table summarizes the planned evaluation methods and text domains of each evaluation track.
Track | Text Domain | Level | Golden Truth Source |
---|---|---|---|
DAsys | news, from WMT18 news task | system-level | direct assessment |
DAseg | news, from WMT18 news task | segment-level | direct assessment |
If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. You are also invited to submit a paper describing your metric.
To take part in the manual evaluation, please sign up yourself for some "accounts" listed in this Google sheet by entering your name into the column "Group". (We particularly need evaluation out of English.)
Each such account contains 2x100 sentences to evaluate and takes approximately 1 hour to complete. Ideally, you should thus complete 8 accounts.
To access the annotation, use this URL pattern:
Make sure that after following this URL, the upper right corner on the web pages says the correct USERNAME. Browser cookies tend to keep the previous account logged in, in which case you need to click the wrong username on that page and select "Sign out" from the drop-down menu. The URL should then log you in with the correct account.
You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
WMT18 metrics task test sets are ready, apologies for the delay.
There are three subsets of outputs that we would like you to evaluate:
The package of inputs for you to evaluate thus comes in three versions:
Here is a bash script that you may want to run around your scorer to process everything:
cd wmt18-metrics-task-nohybrids for testset in `ls -d system-outputs/* | cut -d/ -f2`; do for lp in `ls -d system-outputs/$testset/* | cut -d/ -f3`; do echo "PROCESSING TESTSET $testset, LANGUAGE_PAIR $lp" ref=references/$testset-${lp:0:2}${lp:3:5}-ref.${lp:3:5}s src=sources/$testset-${lp:0:2}${lp:3:5}-src.${lp:0:2} echo " REF: $ref SRC: $src" for hyp in system-outputs/$testset/$lp/*; do echo " EVALUATING $hyp" <YOUR EVALUATION TOOL> --reference=$ref --hypothesis=$hyp --source=$src done done done
You may want to use some of the following data to tune or train your metric.
For system-level, see the results from the previous years:
For segment-level, the following datasets are available:
Each dataset contains:
Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:
You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
The output files for system-level rankings should be called YOURMETRIC.sys.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SYSTEM LEVEL SCORE> <ENSEMBLE> <AVAILABLE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (DAsys+newstest2018
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SYSTEM LEVEL SCORE
is the overall system level score that your metric is predicting.
ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes, non-ensemble
if not).
AVAILABLE
public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval
for example or no
if if it's not available yet.
(This year, we no longer collect the timing information.)
The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM> <SEGMENT NUMBER> <SEGMENT SCORE> <ENSEMBLE> <AVAILABLE>Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2018
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866
for example).SEGMENT NUMBER
is the line number starting from 1 of the plain text input files.SEGMENT SCORE
is the score your metric predicts for the particular segment.ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes, non-ensemble
if not).
AVAILABLE
public availability information for your metric (the appropriate url, https://github.com/jhclark/multeval
for example or no
if if it's not available yet.
Note: fields ENSEMBLE
and AVAILABLE
should be filled with the same value in every line of
the submission file for a given metric. Inclusion in this format involves some redundancy but avoids adding extra files
to the submission requirements.
Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.
As a sanity check, please enter yourself to this shared spreadsheet.
In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.