Quality Estimation Task - EMNLP sixth Conference on Machine Translation

Shared Task: Quality Estimation

**UPDATE** -- Official results available.

Important dates

Release of training and dev data	April 10th, 2021
Release of test data	Early June, 2021
Test predictions deadline	~~July 20th, 2021~~ July 26th, 2021 end of day, AOE

Overview

This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels. The main new elements introduced this year are: (i) a zero-shot sentence-level prediction task to encourage language independent and unsupervised approaches; (ii) a task on predicting catastrophic, i.e. critical translation errors, in other words, errors that make the translation convey a completely different meaning, which could lead to negative effects such as safety risks. In addition, we release new test sets for 2020's Tasks 1 and 2, and an extended version of the Wikipedia post-editing training data from 2 to 7 languages. Finally, for all tasks, participants will be asked to provide info on their model size (disk space without compression and number of parameters) with their submission and will be able to rank systems based on that.

In addition to generally advancing the state of the art in quality estimation, our specific goals are:

to extend the MLQE-PE public benchmark datasets,
to investigate new language independent approaches esp. for zero-shot prediction,
to study the feasibility of unsupervised approaches esp. for zero-shot prediction, and
to create a new task focusing on critical error detection.

For all tasks, the datasets and NMT models that generated the translations are publicly available.

Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.

Task 1: Sentence-Level Direct Assessment

This task offers the same training data as the WMT2020 Task 1: Wikipedia data for 6 language pairs that includes high-resource English--German (En-De) and English--Chinese (En-Zh), medium-resource Romanian--English (Ro-En) and Estonian--English (Et-En), and low-resource Sinhalese--English (Si-En) and Nepalese--English (Ne-En), as well as a dataset with a combination of Wikipedia articles and Reddit articles for Russian-English (Ru-En). The datasets were collected by translating sentences sampled from source language articles using state-of-the-art Transformer NMT models and annotated with a variant of Direct Assessment (DA) scores by professional translators. Each sentence was annotated following the FLORES setup, which presents a form of DA, where at least three professional translators rate each sentence from 0-100 according to the perceived translation quality. DA scores are standardised using the z-score by rater. Participating systems are required to score sentences according to z-standardised DA scores.

New: We provide new blind test sets of 1K sentence pairs for all languages, as well as test sets for 4 new language pairs for which only no training data will be given:

English-Czech
English-Japanese
Pashto-English
Khmer-English

Training, dev and test data: Download the training, development, test20, test21 data consisting of the following Wikipedia/Reddit datasets, all with 7K sentences for training, 1K sentences for development, 1K for the 2020 test set, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from:

English-German
English-Chinese
Romanian-English
Estonian-English
Nepalese-English
Sinhala-English
Russian-English

You can donwload the NMT models used to generate the Wikipedia translations, as well as the Ru-En Wikipedia/Reddit NMT model. Here are details on the training data used to build the Ru-En model. The zero-shot translations were produced by a multilimgual Transformer NMT model.

Test data: We provide 1K new test sentences pairs for each of the language pairs above, as well as info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words. In addition, we will provide test data for 4 new language pairs for zero-shot prediction.

Download the test data 21.

Baseline: The baseline system is a neural predictor-estimator approach implemented in OpenKiwi, similar to the one used here. For the predictor/feature generation part, the baseline model uses a multilingual pre-trained encoder, namely XLM-Roberta (xlm-roberta-base model from huggingface). The baseline model is finetuned on DA scores for Task 1.

Evaluation: Sentence-level submissions will be evaluated in terms of the Pearson's correlation metric for the DA prediction agains human DA (z-standardised mean DA score, i.e. z_mean). These are the official evaluation scripts. The evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages, including the zero-shot ones. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages, and the zero-shot scenario.

Task 2: Word and Sentence-Level Post-editing Effort

This task evaluates the application of QE for post-editing purposes. It consists of predicting:

Word-level tags. This is done both on source side (to detect which words caused errors) and target side (to detect mistranslated or missing words).

Target. Each token is tagged as either OK or BAD. Additionally, each gap between two words is tagged as BAD if one or more missing words should have been there, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence.
Source. Tokens are tagged as OK if they were correctly translated, and BAD otherwise. Gaps are not tagged.

Sentence-level HTER scores. HTER (Human Translation Error Rate) is the ratio between the number of edits (insertions/deletions/replacements) needed and the reference translation length.

Training, dev and test data: The data this year is the same as that used in Task 1, but with labels derived from post-editing. Download the training, development, test20 data. Word-level labels have been obtained by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. HTER values are obtained deterministically from word-level tags. However, when computing HTER, we allow shifts in TER. Please note that we replaced the 2020 training, dev and test sets as there were some issues with the annotation. Make sure to download the new version from the repository.

Test data: We provide 1K new test sentences pairs for each of the language pairs aboveand the 4 new language pairs for zero-shot prediction.

Download the test 21 data .

Baseline: The baseline system is the same as for Task 1, except that here it is finetuned on HTER and word level tags (jointly).

Evaluation: For sentence-level QE, submissions are evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. For word-level QE, they will be evaluated in terms of MCC (Matthews correlation coefficient).

These are the official evaluation scripts.

Task 3: Critical Error Detection

Data for this task is now available in the mlqe-pe repository, including training, dev and test set.

The goal of this task is to predict sentence-level binary scores indicating whether or not a translation contains (at least one) critical error. Translations with such errors are defined as translations that deviate in meaning as compared to the source sentence in such a way that they are misleading and may carry health, safety, legal, reputation, religious or financial implications. Meaning deviations from the source sentence can happen in three ways:

Mistranslation: critical content is translated incorrectly into a different meaning, or not translated (i.e. it remains in the source language) or translated into gibberish.
Hallucination: critical content that is not in the source is introduced in the translation, for example, profanity words are introduced that were not in the source.
Deletion: critical content that is in the source sentence is not present in the translation. For example, the source sentence may contain a negation or hateful word that is removed in the translation.

We focus on a set of critical error categories.

TOX. Deviation in toxicity (hate, violence or profanity) be against an individual or a group (a religion, race, gender, etc.). This error can happen because toxicity is introduced in the translation when it is not in the source, deleted in the translation when it was in the source, or mistranslated into different (toxic or not) words, or not translated at all (i.e. the toxicity remains in the source language or it is transliterated).
SAF. Deviation in health or safety risks, i.e. the translation contains errors that may bring a risk to the reader. This issue can happen because content is introduced in the translation when it is not in the source, deleted in the translation when it was in the source, or mistranslated into different words, or not translated at all (i.e. it remains in the source language).
NAM. Deviation in named entities. A named entity (people, organization, location, etc.) is deleted, mistranslated by either another incorrect named entity or a common word or gibberish, or left untranslated when it should be translated, or transliterated where the transliteration makes no sense in the target language (i.e. the reader cannot recover the actual named entity from it), or introduced when it was not in the source text. If the named entity is translated partially correctly but one can still understand that it refers to the same entity, it should not be an error. You may encounter many usernames in this data. If you can tell it’s a username, consider it as a named entity. If you are not sure, do not annotate it as a NAM error.
SEN. Deviation in sentiment polarity or negation. The MT either introduces or removes a negation (with or without an explicit negation word), or reverses the sentiment of the sentence (e.g. a negative sentence becomes positive or vice-versa). We note that SEN errors do not always involve a full negation, for example, replacing “possibly” with “with certainty” constitutes a SEN error.
NUM. Deviation in units/time/date/numbers. The MT translated a number/date/time or unit incorrectly (or translated it as gibberish), or removed it, which could lead someone to miss an appointment, get lost, etc.

See examples of critical translations for each category. For this task, we are not expecting the errors to be categorised or to have the span identified in the sentence, but rather to have a binary prediction: 1 (it contains at least one critical error in the above categories), or 0 (it does not contain a critical error in the above categories). In either case, the translation may also contain other types of errors, critical or not.

Training, development and test data: The data consists of Wikipedia comments in English extracted from two sources: the Jigsaw Toxic Comment Classification Challenge and the Wikipedia Comments Corpus, with translations generated by the ML50 multilingual translation model by FAIR. It contains instances in the following languages:

English-Czech
English-Japanese
English-Chinese
English-German

Test data: Approximately 1K sentence pairs for each language pair are provided.

Baseline:The baseline system is a MonoTransQuest model similar to the one used here, with default hyperparameter vaues and XLM-Roberta (namely xlm-roberta-base) as pre-trained presentation, finetuned in the labels provided as a binary classifier. We thank Genze Jiang for helping with the baselines!

Evaluation: Submissions will be evaluated in terms of standard classification metrics, with MCC as the main metric. These are the official evaluation scripts.

Submission Information

For CODALAB submissions, click:

Submission Format

Tasks 1, 2 and 3 (sentence-level)

The output of your system for the sentence-level subtask should be a single file with the two first lines indicating model size, and the rest containing predicted scores, one per line for each sentence, formatted as:

Line 1:

<DISK FOOTRPINT (in bytes, without compression)>

Line 2:

<NUMBER OF PARAMETERS>

Lines 3-n where -n is the number of test samples:

<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

LANGUAGE PAIR is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring.
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0 for tasks 1 & 2, matching the sentence pair identifier from the blind test file for task 3).
SEGMENT SCORE is the predicted (DA/HTER/Binary) score for the particular segment.

Each field should be delimited by a single tab character.

Task 2 (word-level)

We request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

Line 1:

<DISK FOOTRPINT (in bytes, without compression)>

Line 2:

<NUMBER OF PARAMETERS>

Lines 3-n where -n is the number of test samples:

<LANGUAGE PAIR> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

LANGUAGE PAIR is the ID (e.g., en-de) of the language pair.
METHOD NAME is the name of your quality estimation method.
TYPE is the type of label predicted: mt, gap or source.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
WORD actual word. For the 'gap' submission, use a dummy symbol: 'gap'.
BINARY SCORE is either 'OK' for no issue or 'BAD' for any issue.

Each field should be delimited by a single tab character.

Additional Resources

These are the parallel data used to train the NMT models for tasks 1 and 2:

Useful Software

Here are some open source software for QE that might be useful for participants:

Submission Requirements

Each participating team can submit at most 30 systems for each of the language pairs of each subtask, except for the multilingual track of tasks 1 & 2 (10 systems max). These should be submitted to a CODALAB page for each subtask.

Please check that your system output on the dev data is correctly read by the official evaluation scripts.

Organisers

Lucia Specia (Imperial College London, University of Sheffield)
Marina Fomicheva (University of Sheffield)
Zhenhao Li (Imperial College London)
Frédéric Blain (University of Wolverhampton)
Paco Guzmán (Facebook)
Vishrav Chaudhary (Facebook)
Chryssa Zerva (Instituto de Telecomunicações)
André Martins (Instituto de Telecomunicações, Unbabel)

Contact

For questions or comments on Tasks 1 and 3, email lspecia@gmail.com.
For questions or comments on Task 2, email erickrfonseca@gmail.com.
For questions or comments on Codalab, please use the forum available for each task.

Supported by the European Commission under the project Bergamot.