Translation Task - EMNLP 2015 Tenth Workshop on Statistical Machine Translation

EMNLP 2015 TENTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

17-18 September 2015
Lisbon, Portugal

This shared task will build on its previous three editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We once again consider word-level and sentence-level estimation. Moreover, this year introduces a new task: document-level estimation. The sentence- and word-level tasks will explore a much larger dataset in comparison to previous years. In addition, the quality annotations for this dataset have been produced from crowdsourced post-editions, instead of professional translators. Altogether, our tasks have the following goals:

To push current work on sentence- and word-level quality estimation by providing larger datasets.
To investigate the effectiveness of quality labels, features and learning methods for document-level prediction.
To explore differences between sentence-level and document-level prediction.
To analyse the effect of training data sizes and quality for sentence and word-level prediction, particularly the use of annotations obtained from crowdsourced post-editing.

This year's shared task provides new training and test datasets for all tasks, but allow participants to reuse data and resources from previous years, or any external resource deemed relevant. An online system was used to produce translations for the sentence- and word-level tasks, and multiple MT systems were used to produce translations for the document-level tasks. Therefore, resources used to build the actual MT systems (or any internal MT features) cannot be made available.

Task 1: Sentence-level QE

Results here , gold-standard labels here

This task consists in scoring (and ranking) sentences according to the percentage of edits need to be fixed (HTER). It is similar to task 1.2 in WMT14, with HTER used as quality score , i.e. the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. The data is the same as that used for the WMT15 APE task. Translations are produced by a single online SMT system, which needs to be treated as black-box as we do not have access to the actual system. Each of the training and test translations was post-edited by a crowdsourced translator, and HTER labels were computed using TER (default settings: tokenised, case insensitive, exact matching only, but with scores capped to 1).

As training and development data, we provide English-Spanish datasets with 11,271 and 1,000 source sentences, their machine translations, their post-editions (translations) and HTER scores, respectively. Download development data (and baseline features). Download training data (and baseline features).

As test data, we provide a new set of 1,817 English-Spanish translations produced by the same SMT system used for the training data. Download test data (and baseline features).

The same 17 features used in WMT12-13-14 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst is used to build prediction models and this script is used to evaluation the models. For significance tests, we use the bootstrap resampling method with this code.

As in previous years, two variants of the results can be submitted:

Scoring: An absolute quality score for each sentence translation according to the type of prediction, to be interpreted as an error metric: lower scores mean better translations.
Ranking: A ranking of sentence translations for all source sentences from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, post-editing time, etc.). The reference ranking will be defined based on the true HTER scores.

Evaluation is performed against the true label and/or HTER ranking using the same metrics as in previous years:

Scoring: Mean Average Error (MAE) (primary metric), Root Mean Squared Error (RMSE).
Ranking: DeltaAvg (primary metric) and Spearman's rank correlation.

Task 2: Word-level QE

Results here , gold-standard labels here

The goal of this task is to evaluate the extent to which we can detect word-level errors in Machine Translation output by annotating translation errors on a sub-sentence level. Often, the overall quality of a translated segment is significantly lowered by specific errors in a a small number of words or phrases. Various types of errors can be found in translations, but for this task we consider all error types together, creating a binary distinction between 'GOOD' and 'BAD' tokens.

The data for this task is the same as provided in Task 1, with English-Spanish machine translations produced by the same online SMT system. All segments have been automatically annotated for errors with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. The edit operations considered as errors are: replacements, insertions and deletions. Shifts (word order errors) were not annotated as such (but rather as deletions+insertions) to avoid introducing noise in the annotation.

As training and development data, we provide the tokenized translation outputs with tokens annotated with good or bad labels. Download development data (and baseline features). Download training data (and baseline features).

As test data, we provide tokens from additional 1,817 English-Spanish sentences, produced in the same way. Download test data (and baseline features).

Submissions are evaluated in terms of classification performance (precision, recall, F-1) against the original labels. The main evaluation metric is the average F1 for the "Bad" class. Evaluation script. We also provide an alternative evaluation script that takes as input labels in the exact same format as the labels distributed for training and dev sets, i.e.: one line per sentence, one tag per word, whitespace separated, with tags in the set {'OK', 'BAD'}. For significance tests, we used the approximate randomisation method with this code.

As baseline system for this task we use the baseline features provided above to train a binary classifier using a standard logistic regression algorithm (available for example in the scikit-learn toolkit).

Task 3: Document-level QE

Results here , gold-standard labels here

This task consists of predicting the quality of units larger than sentences. For practical reasons, in this first edition, we will use paragraphs, as opposed to entire documents. We consider as application a scenario where the reader needs to process the translation of an entire text, as opposed to individual sentences, and has no knowledge of the source language. The quality label is computed against references using METEOR (settings: exact match, not tokenised, case insensitive, capped to 1 - from the Asiya toolkit). Participants are encouraged to devise and explore document-wide features.

For the training of prediction models, we provide a new dataset consisting of source paragraphs and their machine translations (for English-German or German-English), all in the news domain, extracted from the test set of WMT13 and MT systems that participated in the translation shared task:

800 English→German source→translation paragraphs. Download training data. Download 17 baseline feature set.
800 German→English source→translation paragraphs. Download training data. Download 17 baseline feature set.

As test data, we provide a new set of translations produced by the same SMT systems used for the training data:

415 English source paragraphs → 415 German translation suggestions. Download data. Download 17 baseline feature set.
415 German source paragraphs → 415 English translation suggestions. Download data. Download 17 baseline feature set.

Two variants of the results can be submitted:

Scoring: An absolute quality score for each paragraph translation according to the type of prediction, to be interpreted as an error metric: higher scores mean better translations.
Ranking: A ranking of paragraph translations for all source paragraphs from best to worst. For this variant, it does not matter how the ranking is produced (from METEOR predictions, likert predictions, post-editing time, etc.). The reference ranking will be defined based on the true METEOR scores.

For each language pair, evaluation is performed against the true METEOR label and/or ranking using the same metrics as in previous years for sentence-level:

Scoring: Mean Average Error (MAE) (primary metric), Root Mean Squared Error (RMSE).
Ranking: DeltaAvg (primary metric) and Spearman's rank correlation.

QuEst's 17 baseline features for paragraph-level is used as the baseline system. As for sentence-level, the baseline system is trained using SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. We use the same evaluation script as for sentence-level. For significance tests, we use the bootstrap resampling method with this code.

Additional resources

We suggest the following interesting resources that can be used as additional data for training (notice the difference in language pairs and/or text domains and/or MT systems):

WMT14 Quality Estimation shared-task datasets for both word-level and sentence-level annotations including English-Spanish SMT translations (by other systems) as well as other language pairs.
WMT13 Quality Estimation shared-task datasets for English-Spanish SMT translations and their HTER scores, post-editing time scores and annotation of edits at word-level. Description.
WMT12 Quality Estimation shared-task datasets for English-Spanish SMT translations and their 1-5 likert scores. Description.
LIG corpus of 10,881 French-English SMT translations and their human post-editions (HTER scores can be easily derived). Description.
LISMI's TRACE corpora of approximately 7,000 French-English and 7,000 English-French translations by different MT systems, for various text domains, and their post-editions by professionals translators. Description.
CRITT Translation Process Research Database with user activity data of translators behavior collected in several translation studies with Translog-II and with the CASMACAT workbench.

These are the resources we have used to extract the baseline features in Tasks 1 and 3:

English

English source training corpus

English language model

English language model of POS tags

English n-gram counts

English truecase model

Spanish

Spanish source training corpus

Spanish language model

Spanish language model of POS tags

Spanish n-gram counts

Spanish truecase model

German

German source training corpus

German language model

German language model of POS tags

German n-gram counts

German truecase model

Giza tables

English-Spanish Lexical translation table src-tgt

English-German Lexical translation table src-tgt

Spanish-English Lexical translation table src-tgt

German-English Lexical translation table src-tgt

Submission Format

Tasks 1 and 3: Sentence- and paragraph-level

The output of your system a given subtask should produce scores for the translations at the segment-level of the relevant task (sentence or paragraph) formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCORE is the predicted (HTER/METEOR) score for the particular segment - assign all 0's to it if you are only submitting ranking results.
SEGMENT RANK is the ranking of the particular segment - assign all 0's to it if you are only submitting absolute scores.

Each field should be delimited by a single tab character.

Task 2: Word-level QE

The output of your system should produce scores for the translations at the word-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEX is the index of the word in the tokenized sentence, as given in the training/test sets (starting at 0).
WORD actual word.
BINARY SCORE is either 'GOOD' for no issue or 'BAD' for any issue.

Each field should be delimited by a single tab character.

Submission Requirements

Each participating team can submit at most 2 systems for each of the language pairs of each subtask. These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1, 2, 3.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data	February 15, 2015
Release of test data	May 4, 2015
QE metrics results submission deadline	June 2nd, 2015
Paper submission deadline	June 28, 2015
Notification of acceptance	July 21, 2015
Camera-ready deadline	August 11, 2015

Organisers

Chris Hokamp (Dublin City University)
Carolina Scarton (University of Sheffield)
Lucia Specia (University of Sheffield)
Varvara Logacheva (University of Sheffield)

Contact

For questions or comments, email Lucia Specia lspecia@gmail.com.

Supported by the European Commission under the
projects (grant numbers 317471 and 645452)

EMNLP 2015 TENTH WORKSHOPON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

17-18 September 2015 Lisbon, Portugal

Task 1: Sentence-level QE

Task 2: Word-level QE

Task 3: Document-level QE

Additional resources

Submission Format

Tasks 1 and 3: Sentence- and paragraph-level

Task 2: Word-level QE

Submission Requirements

Important dates

Organisers

Contact

EMNLP 2015 TENTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

17-18 September 2015
Lisbon, Portugal