System outputs ready to download | |
Submission deadline for metrics task | |
Paper submission deadline to WMT | >Aug 5, 2021 |
WMT Notification of acceptance | Sept 5, 2021 |
WMT Camera-ready deadline | Sept 15, 2021 |
Conference | Nov 10—11, 2021 |
System outputs are now available to download. (see below for the link and sumission details)
Update, 27 July 12:15 pm UTC: en-de Challenge set source, ref and system outputs updated.
Update, 30 July 8:45 am UTC: additional system outputs added to newstest2021 en-de, en-ru and zh-en.
Please enter yourself ASAP to this shared spreadsheet if you intend to submit to this year's metrics task.
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and the human reference translations. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.
We invite submissions of reference-free metrics in addition to reference-based metrics.
The goals of the shared metrics task are:
Recent work demonstrated that WMT DA has low correlation with expert-based human evaluations for WMT2020 English to German and Chinese to English. These findings call into question conclusions drawn on the basis of WMT human evaluation for high quality MT output. Furthermore, the same paper showed that automatic metrics based on pre-trained embeddings already outperform WMT human ratings on both language pairs. As a consequence, we will integrate the following changes in this year's evaluation campaign:
We will provide you with the source sentences, output of machine translation systems and reference translations.
1. Official results: Correlation with MQM scores on in-domain (news) and out-of-domain data at the sentence and system level on the language pairs:
The inputs will include a selection of MT system submissions to the WMT21 news translation task, online systems, human translations and development systems. We will use Pearson correlation for system-level evaluation and Kendall's Tau for segment-level evaluation.
2. Challenge sets: Accuracy on selecting the better translation on the above language pairs, when comparing high quality translations with MT system outputs that are deliberately corrupted in ways that can be challenging for current automatic metrics.
3. Secondary Evaluation: Correlation with official WMT Direct Assessment (DA) scores at the sentence and system level on the language pairs:
The inputs will include all MT system submissions to the WMT21 news translation task, online systems, and human translations if available. We will use Pearson correlation for system-level evaluation and Kendall's Tau-like evaluation on 'relative ranking' judgements implied from DA for segment-level evaluation.
You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Information on how to submit is available here. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
You may want to use some of the following data to tune or train your metric:
WMT20 en-de, zh-en: https://github.com/google/wmt-mqm-human-evaluation
The MQM dataset contains segment scores, as well as annotations on the category of error and error severity. There are two different file formats for MQM:For system-level, see the results from the previous years:
For segment-level, the following datasets are available:
Each DA dataset (WMT15/WMT16) contains:
Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:
You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
WMT21 metrics task test sets are now available .
Update Aug 30, 2021: (1) added ref-C and ref-D for newstest2021 EnDe, (2) added ref-B for TED ZhEn
There are three subsets of outputs that we need you to evaluate:
These testsets are available in the txt format. We also provide metadata that specifies the document id of each line in the corresponding source/ref/systemoutput.
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both). This year, we no longer include document level evaluation.
There are sample metrics (with random scores) available in the validation folder in the shared data folder.
The output files for system-level rankings should be called YOURMETRIC.sys.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <REFERENCE> <SYSTEM-ID> <SYSTEM LEVEL SCORE>
The output files for segment-level scores should be called YOURMETRIC.seg.score.gz
and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TESTSET> <REFERENCE> <SYSTEM-ID> <SEGMENT NUMBER> SEGMENT SCORE>
Each field should be delimited by a single tab character.
Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).
TEST SET
is the ID of the test set (newstest2021
, florestest2021
, tedtalks
or challengeset
.REFERENCE
is the ID of the reference (ref-A
or ref-B
or ref-C
or ref-D
for reference-based metrics, and src
for reference-free metricsSYSTEM-ID
is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax
for example).SEGMENT NUMBER
is the line number starting from 1 of the plain text input files.SYSTEM SCORE
is the score your metric predicts for the particular system.SEGMENT SCORE
is the score your metric predicts for the particular segment.The newstest2021 testset contains additional independent references for language-pairs csen, de-en, en-de, en-ru, en-zh, ru-en and zh-en. We would like scores for every MT system for each set of references. Note that we do not collect metric scores using multiple references this year.
For language pairs with two references, these reference translations are included in the system-outputs folder to be evaluated alongside the MT systems.
Source-based metrics should score all human translations in addition to the MT systems. Here is a toy example of a langpair where we have N systems and 2 references. The metric.sys.score file would contain:
src-metric langpair newstest2021 src sys1 score ... src-metric langpair newstest2021 src sysN score src-metric langpair newstest2021 src ref-A score src-metric langpair newstest2021 src ref-B score
For languages with two references available, you will need to score the MT systems evaluated against both sets of references individually. Then score each reference against the other. The metric.sys.score file would contain:
#score the MT systems with ref-A ref-metric en-zh newstest2021 ref-A sys1 score ... ref-metric en-zh newstest2021 ref-A sysN score #score the MT systems with ref-B ref-metric en-zh newstest2021 ref-B sys1 score .. ref-metric en-zh newstest2021 ref-B sysN score #score ref-B against ref-A ref-metric en-zh newstest2021 ref-A ref-B score #score ref-A against ref-B ref-metric en-zh newstest2021 ref-B ref-A score
The metric.seg.score file would, likewise, contain the segment scores for systems and human translations included in the metric.sys.score file.
Before you submit, please run your scores files through a validation script, which is now available along with the data in the shared folder.
Note that the English to German data was updated at around 27 July 12:15 pm UTC, and additional system outputs were added to newstest2021 en-de, en-ru and zh-en on July 30th at 8:45 am UTC.
Please enter yourself to this shared spreadsheet asap so we can keep track of your submissions. Submissions should be sent to wmt.metrics@gmail.com with the subject "WMT Metrics submission". You are allowed to submit multiple metrics, but we need you to indicate the primary metric in the email. If submitting more than one metric, please share a folder with all your metrics, for example on Google Drive or Dropbox.
Before August 6th (AOE), please send us an email with: