Shared Task: Quality Estimation

**UPDATE** -- Official results and submissions are available.

This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

These goals are addressed in Tasks 1-4. For Tasks 1-3, in-house statistical and neural MT systems were built to produce translations. Such systems are described in this paper. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant. For Task 4, we used an online neural MT system to produce translations for a subset of this dataset and both the data and annotations are available under creative commons license.



Task 1: Sentence-level QE

Participating systems are required to score (and rank) sentences according to post-editing effort. Three labels are available: percentage of edits need to be fixed (HTER), post-editing time in seconds, and counts of various types of keystrokes. The primary prediction label for the scoring variant will be HTER, but we welcome participants to submit alternative models trained to predict other labels. Predictions according to each alternative label will be evaluated independently. For the ranking variant, the predictions can be generated by models built using any of these labels (or their combination), as well using external information. The data consists of:

Download training and development data for all languages.

For all language pairs and MT system types, we filtered the data from an originally collected set to remove most cases with no edits performed. Skewed distributions towards good quality translations has been proved a problem in previous years, and has aggravated with the in-domain NMT data, where about half of the sentences in for some language pairs require no post-editing at all. We kept a small proportion of HTER=0 sentences in training, development and test sets. The data for download contains source sentences, their machine translations, their post-editions (translations), HTER, post-editing time and keystrokes as post-editing effort scores. The full datasets before filtering can be made available on demand. The PET tool was used to collect these various types of information during post-editing. HTER labels were computed using TER (default settings: tokenised, case insensitive, exact matching only, with scores capped to 1).

As test data, for each language pair we will provide 1,000+ new sentence translations, produced by the same MT system used for the training data for each language pair and MT system type.
NEW: Download the test data and the corresponding baseline features for English-German, English-Czech, English-Lavtian and German-English. For English-German and German-English, we also ask you to submit your results on the 2017 test data so we can attempt to measure progress over years.

The usual 17 features used in WMT12-17 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst++ is used to build the baseline prediction model.

As in previous years, two variants of the results can be submitted:

Evaluation is performed against the true label and/or ranking using as metrics:



Task 2: Word-level QE

As in previous years, we frame the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. Participating systems are required to detect errors for each token in MT output. In addition, in contrast to previous years, for the first time we attempt to predict missing words in the translation. We we require participants label any sequence of one or more missing token with a single 'BAD' label and also indicate 'BAD' tokens in the source sentence that are related to the tokens missing in the translated sentence. This is particularly important to spot adequacy errors in NMT.

The data for this task is exactly the same as provided in Task 1. As with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. Missing tokens in the machine translations, as indicated by the TER tool are annotated as follows: after each token in the sentence and at sentence start, a gap tag is placed. This tag will be set to 'BAD' if in that position there should be one or more tokens, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence. All tokens in the source sentences are also labeled with either 'OK' or 'BAD'. For this, the alignments between source and post-edited sentences are used. If a token is labeled as 'BAD' in the translation, all tokens aligned to it are labeled as 'BAD' in the source sentence. This is meant to indicate which source tokens lead to errors in the translations.

As training and development data, we provide the tokenised and truecased source and translation outputs with source and target tokens annotated with 'OK' or 'BAD' labels, as well as the source-target alignments, and gaps annotated for the translations. Download training and development data for all languages. Download German-English, English-German, English-Czech and English-Latvian baseline features.

As test data, for each language pair we will provide 1,000+ new sentence translations, produced in the same way.
NEW: Download the test data and the corresponding baseline features for English-German, English-Czech, English-Lavtian and German-English. For English-German and German-English, we also ask you to submit your results on the 2017 test data so we can attempt to measure progress over years.

The baseline system is be similar to the baseline used at WMT-15-WMT-17: the set of baseline features includes the same features as the ones used last year with the addition of feature combinations (target word + left/right context, target word + source word, etc.). The features are extracted with the Marmot QE tool. The system is trained with CRFSuite toolkit with passive-aggressive algorithm.

Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

We will also provide an overall F1 score that combines the three labels for systems submitting them all. We use this evaluation script for the metrics, and this script to compute significance levels using approximate randomisation.

Task 3: Word/phrase-level QE with human annotation for phrases

This task uses a subset of the German-English SMT data from Task 1 where each phrase (as produced by the decoder) has been annotated (as a phrase) by humans with four labels: 'OK', 'BAD' -- the phrase contain one or more errors, 'BAD_word_order' -- the phrase is in an incorrect position in the sentence, and 'BAD_omission' -- a word is missing before/after a phrase. We divided this task in two subtasks: word-level prediction (Task3a), and phrase-level prediction (Task3b): Download the German-English 5,921 training and 1,000 development instances for both variants, as well as the baseline features.

As test data, we will provide 543 new sentence translations, produced and annotated in the same way.
NEW: Download the test data and the corresponding baseline features.



Task 4: Document-level QE

This is a completely new task. It is based on data from the Amazon Product Reviews dataset. More specifically, a selection of Sports and Outdoors product titles and descriptions in English which has been machine translated into French using a state of the art online neural MT system. The most popular products (those with more reviews) were chosen. This data poses interesting challenges for machine translation: titles and descriptions are often short and not always a complete sentence. The data was annotated for errors at the word level using a fine-grained error taxonomy (MQM).

MQM is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).

The word error annotations and their severity levels can be extrapolated to phrases, sentences and documents. For this task, we concentrate on the latter, where a document contains the product title and description for a given product. The document-level scores were generated from the word-level errors and their severity using the method in this paper (footnote 6). The dataset is the largest ever released collection with word-level errors manually annotated.

The training and development data contains 1,000/200 English-French training/development documents, with altogether 7,304 segments with words annotated for errors. Download training and development sets.

The baseline system will be the same as that of the document-level task at WMT16, using QuEst++, except for the Giza++ related features. Download a subset of 15 features for all test sets.

NEW: The test data contains 269 English-French documents with 1,652 segments. Download test set.

Submissions will be evaluated as in Task 1, in terms of Pearson's correlation between the true and predicted document-level scores.



Additional resources

These are the resources we have used to extract the baseline features in Task 1, which can also be useful for Tasks 2-3:

English-German

German-English

English-Latvian

English-Czech



Submission Information

For CODALAB submissions, click:

Submission Format

Task 1

The output of your system for a a given subtask should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where: Each field should be delimited by a single tab character.

Tasks 2 and 3a

This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE> 

Where: Each field should be delimited by a single tab character.

Task 3b

The output of your system should produce scores for the translations at the phrase-level. Use up to three separate files, one for each type of label: MT phrases, MT gaps and source phrases, formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <PHRASE INDEX> <PHRASE> <BINARY SCORE> 

Where: Each field should be delimited by a single tab character.

Example of the phrase-level format:

PHRASE_BASELINE 4 0 Geben Sie im Eigenschafteninspektor ( BAD
PHRASE_BASELINE 4 1 " Fenster " > " Eigenschaften " OK
PHRASE_BASELINE 4 2 ) , und wählen Sie BAD
PHRASE_BASELINE 4 3 Statischer Text OK
PHRASE_BASELINE 4 4 oder OK
PHRASE_BASELINE 4 5 Dynamischer Text OK
PHRASE_BASELINE 4 6 . OK

The example shows the labelling for the sentence (double vertical lines show phrase borders):

Geben Sie im Eigenschafteninspektor ( || ' Fenster ' > ' Eigenschaften ' || ) , und wählen Sie || Statischer Text || oder || Dynamischer Text || .

performed by the PHRASE_BASELINE system.

Task 4

The output of your system should produce scores for the translations at the document-level formatted in the following way:

<METHOD NAME> <DOCUMENT NUMBER> <DOCUMENT SCORE> 

Where: The predictions should be sorted by ascending DOCUMENT NUMBER, and each field should be delimited by a single tab character.

Example of the document-level format:

DOC_BASELINE 0 00.000
DOC_BASELINE 1 11.111
DOC_BASELINE 2 22.222

The example shows that documents named "doc0000", "doc0001", "doc0002", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.

Submission Requirements

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs) . These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1, 2, 3.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data February 15, 2018
Release of test data May 15 2018
QE metrics results submission deadline NEW: June 22 2018
Paper submission deadlineJuly 27th 2018
Notification of acceptanceAugust 18th 2018
Camera-ready deadlineAugust 31st 2018

Organisers


Frédéric Blain (University of Sheffield)
Ramon Fernandez (Unbabel)
Varvara Logacheva (Moscow Institute of Physics and Technology)
Andre Martins (Unbabel)
Lucia Specia (University of Sheffield)

Contact

For questions or comments, email lspecia@gmail.com.

Supported by the European Commission under the projects

** OFFICIAL RESULTS **

Results of Task 2, Task 3a/b, Task 4.


Task 1 -- Sentence-level

(up)

English-German (SMT):

Scoring

Ranking

English-German (NMT):

Scoring

Ranking

German-English:

Scoring

Ranking

English-Latvian (SMT):

Scoring

Ranking

English-Latvian (NMT):

Scoring

Ranking

English-Czech:

Scoring

Ranking



Task 2 -- Word-level

(up)

English-German (SMT):

Words in MT

GAPs in MT

Words in SRC

English-German (NMT):

Words in MT

GAPs in MT

Words in SRC

German-English:

Words in MT

GAPs in MT

Words in SRC

English-Latvian (SMT):

Words in MT

GAPs in MT

Words in SRC

English-Latvian (NMT):

Words in MT

GAPs in MT

Words in SRC

English-Czech:

Words in MT

GAPs in MT

Words in SRC



Task 3 -- Phrase-level

(up)

Task3a -- word-level:

Predictions in MT

GAPs in MT

Predictions in SRC

Task3b -- phrase-level:

Predictions in MT

GAPs in MT

Predictions in SRC



Task 4 -- Document-level

(up)

English-French: