This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation aims at providing a quality indicator for unseen translated sentences without relying on reference translations. In this second edition of the shared task, we will consider both word-level and sentence-level estimation.
Some interesting uses of sentence-level quality estimation are the following:
The main goals of the shared quality estimation task are:
This task is similar to the one in WMT12, but with one important difference in the scoring variant: based on feedback received last year, instead of using the [1-5] scores for post-editing effort, we will use HTER as our quality score, i.e.: the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. Two variants of the results can be submitted:
NOTE: Participants are free to use as training data other post-edited material as well ("open" submission). However, for submitting to Task 1.1, we require at least one submission per participant using only the official 2,254 training set ("restricted" submission).
As test data, we provide a new set of translations produced by the same MT system as those used for training. Evaluation will be performed against the HTER and/or ranking of those translations using the same metrics as in WMT12: Mean-Average-Error (MAE), Root-Mean-Squared-Error (RMSE), Spearman's rank correlation, and DeltaAvg.
Participants will be required to rank up to five alternative translations for the same source sentence produced by multiple MT systems. We will use essentially the same data provided to participants of WMT's evaluation metrics task -- where MT evaluation metrics are assessed according to how well they correlate with human rankings. However, reference translations will not be allowed in this task. We provide:
Participating systems will be required to produce for each sentence:
For training we provide a new dataset: English-Spanish news sentences produced by a phrase-based SMT system (Moses), along with their source sentences, post-edited translations and time (in seconds) that was spend on that segment. The data was collected using five translators (with few overlapping annotations). For each segment we provide an ID that specifies the translator who post-edited it (for those interested in training translator-specific models).
As test data, we provide additional source sentences and translations produced with the same SMT system, and IDs of the translators who will post-edit each of these translations (same post-editors as in the training data).
Submissions will evaluated in terms of Mean Average Error (MAE) against the time spent by the same translators post-editing these sentences.
For both Tasks 1.1-1.3, we also provide a system and resources to extract QE features (language model, Giza++ tables, etc.), when these are available. We also provide the machine learning algorithm that will be used as baseline: SVM regression with an RBF kernel, as well as the grid search algorithm for the optimisation of relevant parameters. The same 17 features used in WMT12 will be considered for the baseline systems.
As test data, we provide a tokenized version of the test data used in Task 1.3.
Submissions will evaluated in terms of classification performance (precision, recall, F-1) against the original labels in the two variants (binary and multi-class).
The output of your system should produce scores for the translations at the segment-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>Where:
METHOD NAME
is the name of your
quality estimation method.SEGMENT NUMBER
is the line number
of the plain text translation file you are scoring/ranking.SEGMENT SCORE
is the predicted HTER score for the
particular segment - assign all 0's to it if you are only submiting
ranking results. SEGMENT RANK
is the ranking of
the particular segment - assign all 0's to it if you are only submiting
scores. The format of the output file should be the same as that of the test files provided, with the difference that the empty field "rank=" needs to be completed with a number in 1-5 indicating the ranking of the sentence (allowing ties).
The output of your system should produce scores for the translations at the segment-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <SEGMENT TIME>Where:
METHOD NAME
is the name of your
quality estimation method.SEGMENT NUMBER
is the line number
of the plain text translation file you are scoring.SEGMENT TIME
is the predicted time, in seconds, for the
particular segment. The output of your system should produce scores for the translations at the word-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <BINARY SCORE> <MULTI SCORE>Where:
METHOD NAME
is the name of your quality estimation method.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring.WORD INDEX
is the index of the word in the tokenized sentence, as given in the training/test sets.BINARY SCORE
is either 'K' (Keep) or 'C' (Change) - assign all 0's to it if you are only submiting multi-class scores.MULTI SCORE
is the multi-class score: 'K' (Keep), 'S' (Substitute), or 'D' (Delete) - assign all 0's to it if you are only submiting binary class scores.
INSTITUTION-NAME
_TASK-NAME
_METHOD-NAME
, where:
INSTITUTION-NAME
is an acronym/short name for your institution, e.g. SHEF
TASK-NAME
is one of the following: 1-1, 1-2, 1-3, 2.
METHOD-NAME
is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM
For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2-SVM.
Release of training sets + baseline systems | March 6, 2013: here |
Release of test sets | May 17, 2013: here |
Release of updated data sets for tasks 1.3 and 2 | May 30, 2013: here |
Submission deadline for all QE subtasks | June 5, 2013 |
Paper submission deadline | June 10, 2013 |
You are invited to submit a short paper (4 to 6 pages) describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.
We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.
For questions, comments, etc. please send email to Lucia Specia lspecia@gmail.com.