This shared task will build on its previous three editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We once again consider word-level and sentence-level estimation. Moreover, this year introduces a new task: document-level estimation. The sentence- and word-level tasks will explore a much larger dataset in comparison to previous years. In addition, the quality annotations for this dataset have been produced from crowdsourced post-editions, instead of professional translators. Altogether, our tasks have the following goals:
Results here, gold-standard labels here
This task consists in scoring (and ranking) sentences according to the percentage of edits need to be fixed (HTER). It is similar to task 1.2 in WMT14, with HTER used as quality score , i.e. the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. The data is the same as that used for the WMT15 APE task. Translations are produced by a single online SMT system, which needs to be treated as black-box as we do not have access to the actual system. Each of the training and test translations was post-edited by a crowdsourced translator, and HTER labels were computed using TER (default settings: tokenised, case insensitive, exact matching only, but with scores capped to 1).
As training and development data, we provide English-Spanish datasets with 11,271 and 1,000 source sentences, their machine translations, their post-editions (translations) and HTER scores, respectively. Download development data (and baseline features). Download training data (and baseline features).
As test data, we provide a new set of 1,817 English-Spanish translations produced by the same SMT system used for the training data. Download test data (and baseline features).
The same 17 features used in WMT12-13-14 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst is used to build prediction models and this script is used to evaluation the models. For significance tests, we use the bootstrap resampling method with this code.
As in previous years, two variants of the results can be submitted:
Evaluation is performed against the true label and/or HTER ranking using the same metrics as in previous years:
Results here, gold-standard labels here
The goal of this task is to evaluate the extent to which we can detect word-level errors in Machine Translation output by annotating translation errors on a sub-sentence level. Often, the overall quality of a translated segment is significantly lowered by specific errors in a a small number of words or phrases. Various types of errors can be found in translations, but for this task we consider all error types together, creating a binary distinction between 'GOOD' and 'BAD' tokens.
The data for this task is the same as provided in Task 1, with English-Spanish machine translations produced by the same online SMT system. All segments have been automatically annotated for errors with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. The edit operations considered as errors are: replacements, insertions and deletions. Shifts (word order errors) were not annotated as such (but rather as deletions+insertions) to avoid introducing noise in the annotation.
As training and development data, we provide the tokenized translation outputs with tokens annotated with good or bad labels. Download development data (and baseline features). Download training data (and baseline features).As test data, we provide tokens from additional 1,817 English-Spanish sentences, produced in the same way. Download test data (and baseline features).
Submissions are evaluated in terms of classification performance (precision, recall, F-1) against the original labels. The main evaluation metric is the average F1 for the "Bad" class. Evaluation script. We also provide an alternative evaluation script that takes as input labels in the exact same format as the labels distributed for training and dev sets, i.e.: one line per sentence, one tag per word, whitespace separated, with tags in the set {'OK', 'BAD'}. For significance tests, we used the approximate randomisation method with this code.
As baseline system for this task we use the baseline features provided above to train a binary classifier using a standard logistic regression algorithm (available for example in the scikit-learn toolkit).
Results here, gold-standard labels here
This task consists of predicting the quality of units larger than sentences. For practical reasons, in this first edition, we will use paragraphs, as opposed to entire documents. We consider as application a scenario where the reader needs to process the translation of an entire text, as opposed to individual sentences, and has no knowledge of the source language. The quality label is computed against references using METEOR (settings: exact match, not tokenised, case insensitive, capped to 1 - from the Asiya toolkit). Participants are encouraged to devise and explore document-wide features.
For the training of prediction models, we provide a new dataset consisting of source paragraphs and their machine translations (for English-German or German-English), all in the
As test data, we provide a new set of translations produced by the same SMT systems used for the training data:
Two variants of the results can be submitted:
For each language pair, evaluation is performed against the true METEOR label and/or ranking using the same metrics as in previous years for sentence-level:
QuEst's 17 baseline features for paragraph-level is used as the baseline system. As for sentence-level, the baseline system is trained using SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. We use the same evaluation script as for sentence-level. For significance tests, we use the bootstrap resampling method with this code.
We suggest the following interesting resources that can be used as additional data for training (notice the difference in language pairs and/or text domains and/or MT systems):
These are the resources we have used to extract the baseline features in Tasks 1 and 3:
English
Spanish
German
Giza tables
The output of your system a given subtask should produce scores for the translations at the segment-level of the relevant task (sentence or paragraph) formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>Where:
METHOD NAME
is the name of your
quality estimation method.SEGMENT NUMBER
is the line number
of the plain text translation file you are scoring/ranking.SEGMENT SCORE
is the predicted (HTER/METEOR) score for the
particular segment - assign all 0's to it if you are only submitting
ranking results. SEGMENT RANK
is the ranking of
the particular segment - assign all 0's to it if you are only submitting
absolute scores. The output of your system should produce scores for the translations at the word-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>Where:
METHOD NAME
is the name of your quality estimation method.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0).WORD INDEX
is the index of the word in the tokenized sentence, as given in the training/test sets (starting at 0).WORD
actual word.BINARY SCORE
is either 'GOOD' for no issue or 'BAD' for any issue.
INSTITUTION-NAME
_TASK-NAME
_METHOD-NAME
, where:
INSTITUTION-NAME
is an acronym/short name for your institution, e.g. SHEF
TASK-NAME
is one of the following: 1, 2, 3.
METHOD-NAME
is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM
For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.
You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.
Release of training data | February 15, 2015 |
Release of test data | May 4, 2015 |
QE metrics results submission deadline | June 2nd, 2015 |
Paper submission deadline | June 28, 2015 |
Notification of acceptance | July 21, 2015 |
Camera-ready deadline | August 11, 2015 |
For questions or comments, email Lucia Specia lspecia@gmail.com.
Supported by the European Commission under the
projects (grant numbers 317471 and 645452)