Metrics Task - EMNLP sixth Conference on Machine Translation

System outputs are now available to download. (see below for the link and sumission details)

Update, 27 July 12:15 pm UTC: en-de Challenge set source, ref and system outputs updated.

Update, 30 July 8:45 am UTC: additional system outputs added to newstest2021 en-de, en-ru and zh-en.

Please enter yourself ASAP to this shared spreadsheet if you intend to submit to this year's metrics task.

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and the human reference translations. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics.

Goals

Changes this year

Recent work demonstrated that WMT DA has low correlation with expert-based human evaluations for WMT2020 English to German and Chinese to English. These findings call into question conclusions drawn on the basis of WMT human evaluation for high quality MT output. Furthermore, the same paper showed that automatic metrics based on pre-trained embeddings already outperform WMT human ratings on both language pairs. As a consequence, we will integrate the following changes in this year's evaluation campaign:

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

1. Official results: Correlation with MQM scores on in-domain (news) and out-of-domain data at the sentence and system level on the language pairs:

The inputs will include a selection of MT system submissions to the WMT21 news translation task, online systems, human translations and development systems. We will use Pearson correlation for system-level evaluation and Kendall's Tau for segment-level evaluation.

2. Challenge sets: Accuracy on selecting the better translation on the above language pairs, when comparing high quality translations with MT system outputs that are deliberately corrupted in ways that can be challenging for current automatic metrics.

3. Secondary Evaluation: Correlation with official WMT Direct Assessment (DA) scores at the sentence and system level on the language pairs:

The inputs will include all MT system submissions to the WMT21 news translation task, online systems, and human translations if available. We will use Pearson correlation for system-level evaluation and Kendall's Tau-like evaluation on 'relative ranking' judgements implied from DA for segment-level evaluation.

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Information on how to submit is available here. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Training Data

MQM (Multidimensional Quality Metrics) Framework Development/Training Data

DA (Direct Assessment) Development/Training Data

RR (Relative Ranking) from Past Years

Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:

You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

Test Sets (Evaluation Data)

WMT21 metrics task test sets are now available .

Update Aug 30, 2021: (1) added ref-C and ref-D for newstest2021 EnDe, (2) added ref-B for TED ZhEn

There are three subsets of outputs that we need you to evaluate:

newstest2021 and florestest2021: These contain source sentences translated as part of the WMT News translation task.
tedtalks: These are additional sets of sentences translated by WMT21 translation systems in the TED talks domain.
challengeset: These are synthetic outputs generated specifically to challenge automatic metrics.

These testsets are available in the txt format. We also provide metadata that specifies the document id of each line in the corresponding source/ref/systemoutput.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both). This year, we no longer include document level evaluation.

There are sample metrics (with random scores) available in the validation folder in the shared data folder.

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TEST SET> <REFERENCE>   <SYSTEM-ID>   <SYSTEM LEVEL SCORE>

The output files for segment-level scores should be called YOURMETRIC.seg.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TESTSET>   <REFERENCE>   <SYSTEM-ID>      <SEGMENT NUMBER>   SEGMENT SCORE>

Each field should be delimited by a single tab character.

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set (newstest2021, florestest2021, tedtalks or challengeset .
REFERENCE is the ID of the reference (ref-A or ref-B or ref-C or ref-D for reference-based metrics, and src for reference-free metrics

SYSTEM-ID is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax for example).
SEGMENT NUMBER is the line number starting from 1 of the plain text input files.

SYSTEM SCORE is the score your metric predicts for the particular system.
SEGMENT SCORE is the score your metric predicts for the particular segment.

Additional References

The newstest2021 testset contains additional independent references for language-pairs csen, de-en, en-de, en-ru, en-zh, ru-en and zh-en. We would like scores for every MT system for each set of references. Note that we do not collect metric scores using multiple references this year.

Evaluating Human Translations

For language pairs with two references, these reference translations are included in the system-outputs folder to be evaluated alongside the MT systems.

Source-based metrics

Source-based metrics should score all human translations in addition to the MT systems. Here is a toy example of a langpair where we have N systems and 2 references. The metric.sys.score file would contain:

 
  src-metric    langpair    newstest2021    src    sys1      score  
  ...
  src-metric    langpair    newstest2021    src    sysN      score   
  src-metric    langpair    newstest2021    src    ref-A    score 
  src-metric    langpair    newstest2021    src    ref-B    score

Reference-based metrics

For languages with two references available, you will need to score the MT systems evaluated against both sets of references individually. Then score each reference against the other. The metric.sys.score file would contain:

 
#score the MT systems with ref-A
  ref-metric    en-zh    newstest2021    ref-A    sys1    score  
  ...
  ref-metric    en-zh    newstest2021    ref-A    sysN    score  

#score the MT systems with ref-B
  ref-metric    en-zh    newstest2021    ref-B    sys1    score
  ..
  ref-metric    en-zh    newstest2021    ref-B    sysN    score  
 
#score ref-B against ref-A  
  ref-metric    en-zh    newstest2021    ref-A    ref-B   score 

#score ref-A against ref-B   
  ref-metric    en-zh    newstest2021    ref-B    ref-A   score

The metric.seg.score file would, likewise, contain the segment scores for systems and human translations included in the metric.sys.score file.

How to submit

Before you submit, please run your scores files through a validation script, which is now available along with the data in the shared folder.

Note that the English to German data was updated at around 27 July 12:15 pm UTC, and additional system outputs were added to newstest2021 en-de, en-ru and zh-en on July 30th at 8:45 am UTC.

Please enter yourself to this shared spreadsheet asap so we can keep track of your submissions. Submissions should be sent to wmt.metrics@gmail.com with the subject "WMT Metrics submission". You are allowed to submit multiple metrics, but we need you to indicate the primary metric in the email. If submitting more than one metric, please share a folder with all your metrics, for example on Google Drive or Dropbox.

Before August 6th (AOE), please send us an email with:

a short paragraph to describe your metric;
a list of resources that your metric needs. For example None, or WordNet, or GIZA++, or word2vec, or BERT;
if your metric is supervised, then the training and validation datasets. For example, Unsupervised, or WMT20 DA, or MetricsMQM, or proprietary HTER
a reference so we can cite your metric in the metrics task results paper. If this is a submission to WMT21, please email the name of the paper and the list of authors. Otherwise, send a bibtex reference to a previously published paper or a pre-print (like Arxiv).

Metrics Task Organizers

Markus Freitag (Google Research)

Ricardo Rei (Unbabel)

Nitika Mathur (University of Melbourne)

Chi-kiu (Jackie) Lo (NRC Canada)

George Foster (Google Research)

Craig Stewart (Unbabel)

Alon Lavie (Unbabel)

Ondřej Bojar (Charles University)

System outputs ready to download	~~July 15, 2021~~July 26, 2021
Submission deadline for metrics task	~~July 22, 2021~~ August 09, 2021(AoE)
Paper submission deadline to WMT	>Aug 5, 2021
WMT Notification of acceptance	Sept 5, 2021
WMT Camera-ready deadline	Sept 15, 2021
Conference	Nov 10—11, 2021

Shared Task: Metrics

Metrics Task Important Dates