Release of training and dev data | April 10th, 2021 |
Release of test data | Early June, 2021 |
Test predictions deadline |
This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels. The main new elements introduced this year are: (i) a zero-shot sentence-level prediction task to encourage language independent and unsupervised approaches; (ii) a task on predicting catastrophic, i.e. critical translation errors, in other words, errors that make the translation convey a completely different meaning, which could lead to negative effects such as safety risks. In addition, we release new test sets for 2020's Tasks 1 and 2, and an extended version of the Wikipedia post-editing training data from 2 to 7 languages. Finally, for all tasks, participants will be asked to provide info on their model size (disk space without compression and number of parameters) with their submission and will be able to rank systems based on that.
In addition to generally advancing the state of the art in quality estimation, our specific goals are:
For all tasks, the datasets and NMT models that generated the translations are publicly available.
Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.This task offers the same training data as the WMT2020 Task 1: Wikipedia data for 6 language pairs that includes high-resource English--German (En-De) and English--Chinese (En-Zh), medium-resource Romanian--English (Ro-En) and Estonian--English (Et-En), and low-resource Sinhalese--English (Si-En) and Nepalese--English (Ne-En), as well as a dataset with a combination of Wikipedia articles and Reddit articles for Russian-English (Ru-En). The datasets were collected by translating sentences sampled from source language articles using state-of-the-art Transformer NMT models and annotated with a variant of Direct Assessment (DA) scores by professional translators. Each sentence was annotated following the FLORES setup, which presents a form of DA, where at least three professional translators rate each sentence from 0-100 according to the perceived translation quality. DA scores are standardised using the z-score by rater. Participating systems are required to score sentences according to z-standardised DA scores.
New: We provide new blind test sets of 1K sentence pairs for all languages, as well as test sets for 4 new language pairs for which only no training data will be given:Training, dev and test data: Download the training, development, test20, test21 data consisting of the following Wikipedia/Reddit datasets, all with 7K sentences for training, 1K sentences for development, 1K for the 2020 test set, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from:
Test data: We provide 1K new test sentences pairs for each of the language pairs above, as well as info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words. In addition, we will provide test data for 4 new language pairs for zero-shot prediction.
Download the test data 21.Baseline: The baseline system is a neural predictor-estimator approach implemented in OpenKiwi, similar to the one used here. For the predictor/feature generation part, the baseline model uses a multilingual pre-trained encoder, namely XLM-Roberta (xlm-roberta-base model from huggingface). The baseline model is finetuned on DA scores for Task 1.
Evaluation: Sentence-level submissions will be evaluated in terms of the Pearson's correlation metric for the DA prediction agains human DA (z-standardised mean DA score, i.e. z_mean). These are the official evaluation scripts. The evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages, including the zero-shot ones. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages, and the zero-shot scenario.
This task evaluates the application of QE for post-editing purposes. It consists of predicting:
Training, dev and test data: The data this year is the same as that used in Task 1, but with labels derived from post-editing. Download the training, development, test20 data. Word-level labels have been obtained by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. HTER values are obtained deterministically from word-level tags. However, when computing HTER, we allow shifts in TER. Please note that we replaced the 2020 training, dev and test sets as there were some issues with the annotation. Make sure to download the new version from the repository.
Test data: We provide 1K new test sentences pairs for each of the language pairs aboveand the 4 new language pairs for zero-shot prediction.
Baseline: The baseline system is the same as for Task 1, except that here it is finetuned on HTER and word level tags (jointly).
Evaluation: For sentence-level QE, submissions are evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. For word-level QE, they will be evaluated in terms of MCC (Matthews correlation coefficient).
These are the official evaluation scripts.
The goal of this task is to predict sentence-level binary scores indicating whether or not a translation contains (at least one) critical error. Translations with such errors are defined as translations that deviate in meaning as compared to the source sentence in such a way that they are misleading and may carry health, safety, legal, reputation, religious or financial implications. Meaning deviations from the source sentence can happen in three ways:
Training, development and test data: The data consists of Wikipedia comments in English extracted from two sources: the Jigsaw Toxic Comment Classification Challenge and the Wikipedia Comments Corpus, with translations generated by the ML50 multilingual translation model by FAIR. It contains instances in the following languages:
Test data: Approximately 1K sentence pairs for each language pair are provided.
Baseline:The baseline system is a MonoTransQuest model similar to the one used
Evaluation:
Submissions will be evaluated in terms of standard classification metrics, with MCC as the main metric.
These are the official evaluation scripts.
For CODALAB submissions, click:
Submission Information
The output of your system for the sentence-level subtask should be a single file with the two first lines indicating model size, and the rest containing predicted scores, one per line for each sentence, formatted as:
Line 1:<DISK FOOTRPINT (in bytes, without compression)>Line 2:
<NUMBER OF PARAMETERS>Lines 3-n where -n is the number of test samples:
<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>Where:
LANGUAGE PAIR
is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring.METHOD NAME
is the name of your quality estimation method.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0 for tasks 1 & 2, matching the sentence pair identifier from the blind test file for task 3).SEGMENT SCORE
is the predicted (DA/HTER/Binary) score for the particular segment. Each field should be delimited by a single tab character.
We request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:
Line 1:
<DISK FOOTRPINT (in bytes, without compression)>
Line 2:
<NUMBER OF PARAMETERS>
Lines 3-n where -n is the number of test samples:
<LANGUAGE PAIR> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>
Where:
LANGUAGE PAIR
is the ID (e.g., en-de) of the language pair.METHOD NAME
is the name of your quality estimation method.TYPE
is the type of label predicted: mt, gap or source.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0).WORD INDEX
is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.WORD
actual word. For the 'gap' submission, use a dummy symbol: 'gap'.BINARY SCORE
is either 'OK' for no issue or 'BAD' for any issue.Each field should be delimited by a single tab character.
Here are some open source software for QE that might be useful for participants:
Please check that your system output on the dev data is correctly read by the official evaluation scripts.
For questions or comments on Tasks 1 and 3, email lspecia@gmail.com.
For questions or comments on Task 2, email erickrfonseca@gmail.com.
For questions or comments on Codalab, please use the forum available for each task.