Quality Estimation Task - EMNLP 2018 Third Conference on Machine Translation

Shared Task: Quality Estimation

**UPDATE** -- Official results and submissions are available.

This shared task will build on its previous six editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that this year we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language pairs where source segments were translated by both statistical phrase-based and neural MT systems.
To study the predictability of missing words in the MT output. To do so, for the first time we provide data annotated for such errors at training time.
To study the predictability of source words that lead to errors in the MT output. To do so, for the first time we provide source segments annotated for such errors at the word level.
To study the effectiveness of manually assigned labels for phrases. For that we provide a dataset where each phrase was annotated by human translators.
To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, as well as post-editor ID.
To study quality prediction for documents from errors annotated at word-level with added severity judgements. This will be done using a new corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived.

These goals are addressed in Tasks 1-4. For Tasks 1-3, in-house statistical and neural MT systems were built to produce translations. Such systems are described in this paper. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant. For Task 4, we used an online neural MT system to produce translations for a subset of this dataset and both the data and annotations are available under creative commons license.

Task 1: Sentence-level QE

Participating systems are required to score (and rank) sentences according to post-editing effort. Three labels are available: percentage of edits need to be fixed (HTER), post-editing time in seconds, and counts of various types of keystrokes. The primary prediction label for the scoring variant will be HTER, but we welcome participants to submit alternative models trained to predict other labels. Predictions according to each alternative label will be evaluated independently. For the ranking variant, the predictions can be generated by models built using any of these labels (or their combination), as well using external information. The data consists of:

English-German: sentences on the IT domain translated by an in-house phrase-based SMT system (26,273 training and 1,000 development sentences); or translated by an in-house encoder-decoder attention-based NMT system (13,442 training and 1,000 development sentences). Download baseline features.
German-English: sentences on the life sciences (pharmaceutical) domain translated by an in-house phrase-based SMT system (25,963 training and 1,000 development sentences). Download baseline features.
English-Latvian: sentences on the life sciences (pharmaceutical) domain translated by an in-house phrase-based SMT system (11,251 training and 1,000 development sentences); or translated by an in-house encoder-decoder attention-based NMT system (12,936 training and 1,000 development sentences). Download baseline features.
English-Czech: sentences on the IT domain translated by an in-house phrase-based SMT system (40,254 training and 1,000 development sentences). Download baseline features.

Download training and development data for all languages.

For all language pairs and MT system types, we filtered the data from an originally collected set to remove most cases with no edits performed. Skewed distributions towards good quality translations has been proved a problem in previous years, and has aggravated with the in-domain NMT data, where about half of the sentences in for some language pairs require no post-editing at all. We kept a small proportion of HTER=0 sentences in training, development and test sets. The data for download contains source sentences, their machine translations, their post-editions (translations), HTER, post-editing time and keystrokes as post-editing effort scores. The full datasets before filtering can be made available on demand. The PET tool was used to collect these various types of information during post-editing. HTER labels were computed using TER (default settings: tokenised, case insensitive, exact matching only, with scores capped to 1).

As test data, for each language pair we will provide 1,000+ new sentence translations, produced by the same MT system used for the training data for each language pair and MT system type.
NEW: Download the test data and the corresponding baseline features for English-German, English-Czech, English-Lavtian and German-English. For English-German and German-English, we also ask you to submit your results on the 2017 test data so we can attempt to measure progress over years.

The usual 17 features used in WMT12-17 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst++ is used to build the baseline prediction model.

As in previous years, two variants of the results can be submitted:

Scoring: An absolute quality score for each sentence translation according to the type of prediction, to be interpreted as an error metric: lower scores mean better translations.
Ranking: A ranking of sentence translations for all source sentences from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, post-editing time, etc.). The reference ranking will be defined based on the true HTER scores.

Evaluation is performed against the true label and/or ranking using as metrics:

Scoring: Pearson's correlation (primary), Mean Average Error (MAE) and Root Mean Squared Error (RMSE).
Ranking: Spearman's rank correlation (primary) and DeltaAvg.

Task 2: Word-level QE

As in previous years, we frame the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. Participating systems are required to detect errors for each token in MT output. In addition, in contrast to previous years, for the first time we attempt to predict missing words in the translation. We we require participants label any sequence of one or more missing token with a single 'BAD' label and also indicate 'BAD' tokens in the source sentence that are related to the tokens missing in the translated sentence. This is particularly important to spot adequacy errors in NMT.

The data for this task is exactly the same as provided in Task 1. As with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. Missing tokens in the machine translations, as indicated by the TER tool are annotated as follows: after each token in the sentence and at sentence start, a gap tag is placed. This tag will be set to 'BAD' if in that position there should be one or more tokens, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence. All tokens in the source sentences are also labeled with either 'OK' or 'BAD'. For this, the alignments between source and post-edited sentences are used. If a token is labeled as 'BAD' in the translation, all tokens aligned to it are labeled as 'BAD' in the source sentence. This is meant to indicate which source tokens lead to errors in the translations.

As training and development data, we provide the tokenised and truecased source and translation outputs with source and target tokens annotated with 'OK' or 'BAD' labels, as well as the source-target alignments, and gaps annotated for the translations. Download training and development data for all languages. Download German-English, English-German, English-Czech and English-Latvian baseline features.

As test data, for each language pair we will provide 1,000+ new sentence translations, produced in the same way.
NEW: Download the test data and the corresponding baseline features for English-German, English-Czech, English-Lavtian and German-English. For English-German and German-English, we also ask you to submit your results on the 2017 test data so we can attempt to measure progress over years.

The baseline system is be similar to the baseline used at WMT-15-WMT-17: the set of baseline features includes the same features as the ones used last year with the addition of feature combinations (target word + left/right context, target word + source word, etc.). The features are extracted with the Marmot QE tool. The system is trained with CRFSuite toolkit with passive-aggressive algorithm.

Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

words in the MT, as in WMT17 ('OK' for correct words, 'BAD' for incorrect words)
gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing words)
source words ('BAD' for words that lead to errors in the MT, 'OK' for other words)

We will also provide an overall F1 score that combines the three labels for systems submitting them all. We use this evaluation script for the metrics, and this script to compute significance levels using approximate randomisation.

Task 3: Word/phrase-level QE with human annotation for phrases

This task uses a subset of the German-English SMT data from Task 1 where each phrase (as produced by the decoder) has been annotated (as a phrase) by humans with four labels: 'OK', 'BAD' -- the phrase contain one or more errors, 'BAD_word_order' -- the phrase is in an incorrect position in the sentence, and 'BAD_omission' -- a word is missing before/after a phrase. We divided this task in two subtasks: word-level prediction (Task3a), and phrase-level prediction (Task3b):

Task3a -- As training and development data, we provide the tokenised translation output with word-level segmentation for both source and machine-translated sentences, such that this task can be addressed as a word-level prediction task. To annotate ommision errors, a gap token is inserted after each token and at the start of the sentence. The token-level labels are computed as follows: all tokens in the target sentence are labelled according to the label of the phrase they belong to. Therefore, if the phrase is annotated as either 'OK', 'BAD' or 'BAD_word_order', all tokens (and gap tokens!) within that phrase are labelled as either 'OK', 'BAD' or 'BAD_word_order'. The label for tokens between phrases is either 'OK' or 'BAD_omission', where 'BAD_omission' indicates that there should be one or more tokens in that position. The number of tags for each machine translated sentence is 2*N+1, where N is the number of tokens in that sentence.
The baseline will be the same system as in Task 2. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
- words in the MT, as in WMT17 ('OK' for correct words, 'BAD' for incorrect words);
- gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing words);
- source words ('BAD' for words that lead to errors in the MT, 'OK' for other words).

Task3b -- As training and development data, we provide the tokenised translation output with phrase-level segmentation (separator: '||'). A gap token is inserted after each phrase and at the start of the sentence. The gap is labelled as follows: 'OK' or 'BAD_omission', there the latter indicates that one or more words are missing. The labels are at phrase-level, therefore the number of tags for each machine translated sentence is 2*N+1, where N is the number of phrases in that sentence.

The baseline will use a set of baseline features (based on black-box sentence-level features) extracted with the Marmot tool and is trained with the CRFSuite tool. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
- phrases in the MT ('OK' for correct phrases, 'BAD' for incorrect phrases, 'BAD_word_order' for phrases in an incorrect position in the sentence);
- gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing phrases);
- source phrases ('BAD' for phrases that lead to errors in the MT, 'OK' for other phrases).

Download the German-English 5,921 training and 1,000 development instances for both variants, as well as the baseline features.

As test data, we will provide 543 new sentence translations, produced and annotated in the same way.
NEW: Download the test data and the corresponding baseline features.

Task 4: Document-level QE

This is a completely new task. It is based on data from the Amazon Product Reviews dataset. More specifically, a selection of Sports and Outdoors product titles and descriptions in English which has been machine translated into French using a state of the art online neural MT system. The most popular products (those with more reviews) were chosen. This data poses interesting challenges for machine translation: titles and descriptions are often short and not always a complete sentence. The data was annotated for errors at the word level using a fine-grained error taxonomy (MQM).

MQM is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).

The word error annotations and their severity levels can be extrapolated to phrases, sentences and documents. For this task, we concentrate on the latter, where a document contains the product title and description for a given product. The document-level scores were generated from the word-level errors and their severity using the method in this paper (footnote 6). The dataset is the largest ever released collection with word-level errors manually annotated.

The training and development data contains 1,000/200 English-French training/development documents, with altogether 7,304 segments with words annotated for errors. Download training and development sets.

The baseline system will be the same as that of the document-level task at WMT16, using QuEst++, except for the Giza++ related features. Download a subset of 15 features for all test sets.

NEW: The test data contains 269 English-French documents with 1,652 segments. Download test set.

Submissions will be evaluated as in Task 1, in terms of Pearson's correlation between the true and predicted document-level scores.

Additional resources

These are the resources we have used to extract the baseline features in Task 1, which can also be useful for Tasks 2-3:

English-German

English and German source and target corpora
English language model
English n-gram counts
German language model
English-German (and v.v.) lexical translation tables

German-English

German and English source and target corpora
German language model
German n-gram counts
English language model
German-English (and v.v.) lexical translation tables

English-Latvian

English and Latvian source and target corpora
English language model
English n-gram counts
Latvian language model
English-Latvian (and v.v.) lexical translation tables

English-Czech

English and Czech source and target corpora
English language model
English n-gram counts
Czech language model
English-Czech (and v.v.) lexical translation tables

Submission Information

For CODALAB submissions, click:

Submission Format

Task 1

The output of your system for a a given subtask should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCORE is the predicted (HTER) score for the particular segment - assign all 0's to it if you are only submitting ranking results.
SEGMENT RANK is the ranking of the particular segment - assign all 0's to it if you are only submitting absolute scores.

Each field should be delimited by a single tab character.

Tasks 2 and 3a

This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
TYPE is the type of label predicted: mt, gap or source.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
WORD actual word. For the 'gap' submission, use a dummy symbol: 'gap'.
BINARY SCORE is either 'OK' for no issue or 'BAD' for any issue.

Each field should be delimited by a single tab character.

Task 3b

The output of your system should produce scores for the translations at the phrase-level. Use up to three separate files, one for each type of label: MT phrases, MT gaps and source phrases, formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <PHRASE INDEX> <PHRASE> <BINARY SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
TYPE is the type of label predicted: mt, gap or source.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
PHRASE INDEX is the index of the phrase in the segmented sentence, as given in the training/test sets (starting at 0). This will be the phrase index within the MT sentence or the source sentence, or the gap index for MT gaps.
PHRASE actual phrase. Multiword phrases should have words delimited by spaces. For the 'gap' submission, use a dummy symbol: 'gap'.
LABEL is either 'OK' for no issue, 'BAD_word_order' if the phrase is in an incorrect position in the sentence, or 'BAD' for any other issue.

Each field should be delimited by a single tab character.

Example of the phrase-level format:

`PHRASE_BASELINE`	4	0	Geben Sie im Eigenschafteninspektor (	BAD
`PHRASE_BASELINE`	4	1	" Fenster " > " Eigenschaften "	OK
`PHRASE_BASELINE`	4	2	) , und wählen Sie	BAD
`PHRASE_BASELINE`	4	3	Statischer Text	OK
`PHRASE_BASELINE`	4	4	oder	OK
`PHRASE_BASELINE`	4	5	Dynamischer Text	OK
`PHRASE_BASELINE`	4	6	.	OK

The example shows the labelling for the sentence (double vertical lines show phrase borders):

Geben Sie im Eigenschafteninspektor ( || ' Fenster ' > ' Eigenschaften ' || ) , und wählen Sie || Statischer Text || oder || Dynamischer Text || .

performed by the PHRASE_BASELINE system.

Task 4

The output of your system should produce scores for the translations at the document-level formatted in the following way:

<METHOD NAME> <DOCUMENT NUMBER> <DOCUMENT SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
DOCUMENT NUMBER is the line number of the plain text translation file you are scoring.
DOCUMENT SCORE is the predicted score for the particular document.

The predictions should be sorted by ascending DOCUMENT NUMBER, and each field should be delimited by a single tab character.

Example of the document-level format:

`DOC_BASELINE`	0	00.000
`DOC_BASELINE`	1	11.111
`DOC_BASELINE`	2	22.222

The example shows that documents named "doc0000", "doc0001", "doc0002", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.

Submission Requirements

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs) . These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1, 2, 3.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data	February 15, 2018
Release of test data	May 15 2018
QE metrics results submission deadline	NEW: June 22 2018
Paper submission deadline	July 27th 2018
Notification of acceptance	August 18th 2018
Camera-ready deadline	August 31st 2018

Organisers

Frédéric Blain (University of Sheffield)
Ramon Fernandez (Unbabel)
Varvara Logacheva (Moscow Institute of Physics and Technology)
Andre Martins (Unbabel)
Lucia Specia (University of Sheffield)

Contact

For questions or comments, email lspecia@gmail.com.

Supported by the European Commission under the projects

OFFICIAL RESULTS

Results of Task 2, Task 3a/b, Task 4.

Task 1 -- Sentence-level

(up)

English-German (SMT):

Scoring	Ranking

English-German (NMT):

Scoring	Ranking

German-English:

Scoring	Ranking

English-Latvian (SMT):

Scoring	Ranking

English-Latvian (NMT):

Scoring	Ranking

English-Czech:

Scoring	Ranking

Task 2 -- Word-level

(up)

English-German (SMT):

Words in MT	GAPs in MT	Words in SRC

English-German (NMT):

Words in MT	GAPs in MT	Words in SRC

German-English:

Words in MT	GAPs in MT	Words in SRC

English-Latvian (SMT):

Words in MT	GAPs in MT	Words in SRC

English-Latvian (NMT):

Words in MT	GAPs in MT	Words in SRC

English-Czech:

Words in MT	GAPs in MT	Words in SRC

Task 3 -- Phrase-level

(up)

Task3a -- word-level:

Predictions in MT	GAPs in MT	Predictions in SRC

Task3b -- phrase-level:

Predictions in MT	GAPs in MT	Predictions in SRC

Task 4 -- Document-level

(up)

Shared Task: Quality Estimation

Task 1: Sentence-level QE

Task 2: Word-level QE

Task 3: Word/phrase-level QE with human annotation for phrases

Task 4: Document-level QE

Additional resources

Submission Information

Submission Format

Task 1

Tasks 2 and 3a

Task 3b

Task 4

Submission Requirements

Important dates

Organisers

Contact

** OFFICIAL RESULTS **

Task 1 -- Sentence-level

English-German (SMT):

Scoring

Ranking

English-German (NMT):

Scoring

Ranking

German-English:

Scoring

Ranking

English-Latvian (SMT):

Scoring

Ranking

English-Latvian (NMT):

Scoring

Ranking

English-Czech:

Scoring

Ranking

Task 2 -- Word-level

English-German (SMT):

Words in MT

GAPs in MT

Words in SRC

English-German (NMT):

Words in MT

GAPs in MT

Words in SRC

German-English:

Words in MT

GAPs in MT

Words in SRC

English-Latvian (SMT):

Words in MT

GAPs in MT

Words in SRC

English-Latvian (NMT):

Words in MT

GAPs in MT

Words in SRC

English-Czech:

Words in MT

GAPs in MT

Words in SRC

Task 3 -- Phrase-level

Task3a -- word-level:

Predictions in MT

GAPs in MT

Predictions in SRC

Task3b -- phrase-level:

Predictions in MT

GAPs in MT

Predictions in SRC

Task 4 -- Document-level

English-French:

OFFICIAL RESULTS