Test data for the WMT2022 metric shared task.
Mt-metrics-eval: the tool for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.
Β | Date |
---|---|
System outputs ready to download | 16th August, 2022 |
Submission deadline for metrics task | 23th August, 2022 |
Paper submission deadline to WMT | 7th September, 2022 |
WMT Notification of acceptance | 9th October, 2022 |
WMT Camera-ready deadline | 16th October, 2022 |
Conference | 7th - 8th December, 2022 |
The goals of the shared metrics task are:
We will provide you with the source sentences, output of machine translation systems and reference translations.
You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you donβt, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
β
Since data from previous WMT editions might be difficult to navigate we are adding a table with links to download data from previous years. You have new links in the New: Download links section
The WMT Metrics shared task takes place yearly since 2008. You may want to use data from previous editions to tune/train your metric. The following table provides links to the descriptions, the raw data and the findings papers of the previous editions:
year | MQM | DA system level | DA segment level | relative ranking | paper | .bib |
---|---|---|---|---|---|---|
2021 | π | π | π | Β | π | π |
2020 | π | π | π | Β | π | π |
2019 | Β | π | π | Β | π | π |
2018 | Β | π | π | Β | π | π |
2017 | Β | π | π | Β | π | π |
2016 | Β | π | π | Β | π | π |
2015 | Β | Β | Β | π | π | π |
2014 | Β | Β | Β | π | π | π |
2013 | Β | Β | Β | π | π | π |
2012 | Β | Β | Β | π | π | π |
2011 | Β | Β | Β | π | π | π |
2010 | Β | Β | Β | π | π | π |
2009 | Β | Β | Β | π | π | π |
2008 | Β | Β | Β | π | π | π |
You can use any past yearβs data to tune your metricβs free parameters if it has any for this yearβs submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Also, for running the mearure metrics quality, specially new ones, we encourage you to use mt-metrics-eval repo developed by George Foster.
year | DA | relative ranks | paper |
---|---|---|---|
2017 | π | π | Results of the WMT17 Metrics Shared Task |
2018 | π | π | Results of the WMT18 Metrics Shared Task |
2019 | π | π | Results of the WMT19 Metrics Shared Task |
2020 | π | π | Results of the WMT20 Metrics Shared Task |
β
: We are not providing links to the Direct Assessments from 2021 because we found bugs in the scores. We advise participants to avoid using that data. For 2021 you can rely on the MQM annotations below π.
year | LP | testset | paper |
---|---|---|---|
2020 | en-de π | Newstest2020 | A Large-Scale Study of Human Evaluation for Machine Translation |
2020 | zh-en π | Newstest2020 | A Large-Scale Study of Human Evaluation for Machine Translation |
2021 | en-ru π | Newstest2021 | Results of the WMT21 Metrics Shared Task |
2021 | en-de π | Newstest2021 | Results of the WMT21 Metrics Shared Task |
2021 | zh-en π | Newstest2021 | Results of the WMT21 Metrics Shared Task |
2021 | en-ru π | Ted Talks | Results of the WMT21 Metrics Shared Task |
2021 | en-de π | Ted Talks | Results of the WMT21 Metrics Shared Task |
2021 | zh-en π | Ted Talks | Results of the WMT21 Metrics Shared Task |
β
: MQM data for en-de and zh-en was mostly annotated by Google and it ranges -25 to 0 where 0 is a perfect translation and -25 is the worse possible score. On the other hand, en-ru data was annotated by Unbabel and ranges -inf to 100 where 100 is a perfect translation and something below 0 is a bad translation. You can find the original data here with more information about raters, etcβ¦
You can download the System outputs from here
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
We release along with the data two python scripts to help you score the data. The scripts should be easy to modify in order to run your metrics. We advise you to use them.
We also provide 4 examples of scored data using BLEU, chrF, BLEURT, and COMET-QE (for QE-as-a-metric) available here
Output file format for system-level rankings
The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted as a tab-separated values (TSV) in the following way:
METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tREFERENCE\tSYSTEM-ID\tSYSTEM-SCORE
The output files for segment-level scores should be called YOURMETRIC.seg.score.gz and formatted as a tab-separated values (TSV) in the following way:
METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tDOCUMENT\tREFERENCE\tSYSTEM-ID\tSEGMENT-NUMBER\tSEGMENT-SCORE
Each field should be delimited by a single tab character.
Where:
metrics_inputs/txt/generaltest2022/metadata
.Before you submit, please run your scores files through a validation script, which is now available here. You can use it along with either BLEU or COMET-QE sys and seg scores files in the baselines folder
Please enter yourself to this shared spreadsheet so we can keep track of your submissions.
Submissions should be sent to wmt22-metric@googlegroups.com with the subject βWMT Metrics submissionβ.
You are allowed to submit multiple metrics, but we need you to indicate the primary metric in the email. If submitting more than one metric, please share a folder with all your metrics, for example on Google Drive or Dropbox.
Before August 30th (AOE), please send us an email with: