Quality Estimation Task - ACL 2013 Eighth Workshop on Statistical Machine Translation

ACL 2013 EIGHT WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

8-9 August, 2013
Sofia, Bulgaria

This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation aims at providing a quality indicator for unseen translated sentences without relying on reference translations. In this second edition of the shared task, we will consider both word-level and sentence-level estimation.

Some interesting uses of sentence-level quality estimation are the following:

Decide whether a given translation is good enough for publishing as is
Inform readers of the target language only whether or not they can rely on a translation
Filter out sentences that are not good enough for post-editing by professional translators
Select the best translation among options from multiple MT and/or translation memory systems

Some interesting uses of word-level quality estimation are the following:

Highlight words that need editing in post-editing tasks
Inform readers of portions of the sentence that are not reliable
Select the best segments among options from multiple translation systems for MT system combination

Last year, a first shared task was organised as part of WMT12 on sentence-level estimation. This task provided a set of baseline features, datasets, evaluation metrics, and oracle results. The task attracted an impressive number of participants. Building on last year's experience, this year's shared task will reuse some of these resources, but provide additional training and test sets, use different annotation schemes and propose a few variants of the task for word- and sentence-level quality estimation.

Goals

The main goals of the shared quality estimation task are:

To push current work on sentence-level quality estimation towards robust models that can work across MT systems;
To test work on sentence-level quality estimation for the task of selecting the best translation amongst multiple systems;
To evaluate the applicability of quality estimation for post-editing tasks;
To provide a first common ground for development and comparison of quality estimation systems at word-level.

Task 1: Sentence-level QE

Task 1.1 Scoring and ranking for post-editing effort

This task is similar to the one in WMT12, but with one important difference in the scoring variant: based on feedback received last year, instead of using the [1-5] scores for post-editing effort, we will use HTER as our quality score, i.e.: the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. Two variants of the results can be submitted:

Scoring: A quality score for each sentence translation in [0,1], to be interpreted as an HTER score; lower scores mean better translations.
Ranking: A ranking of sentence translations for a number of source sentences, produced by the same MT system, from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, or even without machine learning). The reference ranking will be defined based on the true HTER scores.

For the training of models, we provide the WMT12 dataset: 2,254 English-Spanish news sentences produced by a phrase-based SMT system (Moses) trained on Europarl and News Commentaries corpora as provided by WMT, along with their source sentences, reference translations, post-edited translations, and HTER scores. We used TERp (default settings: tokenised, case insensitive, etc., but capped to 1) to compute the HTER scores. Likert scores are also provided, if participants prefer to use them for the ranking variant.

NOTE: Participants are free to use as training data other post-edited material as well ("open" submission). However, for submitting to Task 1.1, we require at least one submission per participant using only the official 2,254 training set ("restricted" submission).

As test data, we provide a new set of translations produced by the same MT system as those used for training. Evaluation will be performed against the HTER and/or ranking of those translations using the same metrics as in WMT12: Mean-Average-Error (MAE), Root-Mean-Squared-Error (RMSE), Spearman's rank correlation, and DeltaAvg.

Task 1.2 System selection (new)

Participants will be required to rank up to five alternative translations for the same source sentence produced by multiple MT systems. We will use essentially the same data provided to participants of WMT's evaluation metrics task -- where MT evaluation metrics are assessed according to how well they correlate with human rankings. However, reference translations will not be allowed in this task. We provide:

Training data: A large set of up to five alternative machine translations produced by different MT systems for each source sentence and ranked for quality by humans. This is the outcome of the manual evaluation of the translation task from WMT09-WMT12. It includes two language pairs: German-English and English-Spanish, with 7,098 and 3,117 source sentences and up to five ranked translations, respectively.
Test data: A new set of up to 5 alternative machine translations per source sentence. Notice that there will be some overlap between the MT systems used in the training data and test data, but not all systems will be the same.

Evaluation for each language pair will be performed against human ranking of pairs of alternative translations, using as metric the overall Kendall's tau correlation (i.e. weighted average).

Task 1.3 Predicting post-editing time (new)

Participating systems will be required to produce for each sentence:

Expected post-editing time: a real valued estimate of the time (in seconds) it takes a translator to post-edit the MT output.

For training we provide a new dataset: English-Spanish news sentences produced by a phrase-based SMT system (Moses), along with their source sentences, post-edited translations and time (in seconds) that was spend on that segment. The data was collected using five translators (with few overlapping annotations). For each segment we provide an ID that specifies the translator who post-edited it (for those interested in training translator-specific models).

As test data, we provide additional source sentences and translations produced with the same SMT system, and IDs of the translators who will post-edit each of these translations (same post-editors as in the training data).

Submissions will evaluated in terms of Mean Average Error (MAE) against the time spent by the same translators post-editing these sentences.

For both Tasks 1.1-1.3, we also provide a system and resources to extract QE features (language model, Giza++ tables, etc.), when these are available. We also provide the machine learning algorithm that will be used as baseline: SVM regression with an RBF kernel, as well as the grid search algorithm for the optimisation of relevant parameters. The same 17 features used in WMT12 will be considered for the baseline systems.

Task 2: Word-level QE (new)

The data for this task is based on the same resources and data as in Task 1.3, but with word-level labels. Participating systems will be required to produce for each token a label in one of the following settings:

Binary classification: a good/bad label, where bad indicates the need for editing the token.
Multi-class classification: a label specifying the edit action needed for the token (keep as is, delete, or substitute).

As training data, we provide tokenized MT-output with tokens token annotated with multiclass (good/delete/substitute) labels. The annotation is derived automatically by computing TER (with some tweaks) between the original machine translation and its post-edited version. For the binary variant, labels will be grouped in two: good (keep) versus all others (delete or substitute).

As test data, we provide a tokenized version of the test data used in Task 1.3.

Submissions will evaluated in terms of classification performance (precision, recall, F-1) against the original labels in the two variants (binary and multi-class).

Download

Data, resources and baseline systems

Submission Format

Task 1.1 Scoring and ranking for post-editing effort

The output of your system should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCORE is the predicted HTER score for the particular segment - assign all 0's to it if you are only submiting ranking results.
SEGMENT RANK is the ranking of the particular segment - assign all 0's to it if you are only submiting scores.

Each field should be delimited by a single tab character.

Task 1.2 System selection

The format of the output file should be the same as that of the test files provided, with the difference that the empty field "rank=" needs to be completed with a number in 1-5 indicating the ranking of the sentence (allowing ties).

Task 1.3 Predicting post-editing time

The output of your system should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT TIME>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring.
SEGMENT TIME is the predicted time, in seconds, for the particular segment.

Each field should be delimited by a single tab character.

Task 2: Word-level QE

The output of your system should produce scores for the translations at the word-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <BINARY SCORE> <MULTI SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring.
WORD INDEX is the index of the word in the tokenized sentence, as given in the training/test sets.
BINARY SCORE is either 'K' (Keep) or 'C' (Change) - assign all 0's to it if you are only submiting multi-class scores.
MULTI SCORE is the multi-class score: 'K' (Keep), 'S' (Substitute), or 'D' (Delete) - assign all 0's to it if you are only submiting binary class scores.

Each field should be delimited by a single tab character.

Submission Requirements

We require that each participating team submits at most 2 submissions for each of the variants of the task. These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1-1, 1-2, 1-3, 2.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2-SVM.

IMPORTANT DATES

Release of training sets + baseline systems	March 6, 2013: here
Release of test sets	May 17, 2013: here
Release of updated data sets for tasks 1.3 and 2	May 30, 2013: here
Submission deadline for all QE subtasks	June 5, 2013
Paper submission deadline	June 10, 2013

ORGANIZERS

Christian Buck (University of Edinburgh)
Radu Soricut (Google)
Lucia Specia (University of Sheffield)

Other Requirements

You are invited to submit a short paper (4 to 6 pages) describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.

CONTACT

For questions, comments, etc. please send email to Lucia Specia lspecia@gmail.com.

ACL 2013 EIGHT WORKSHOP ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

8-9 August, 2013 Sofia, Bulgaria

Goals

Task 1: Sentence-level QE

Task 1.1 Scoring and ranking for post-editing effort

Task 1.2 System selection (new)

Task 1.3 Predicting post-editing time (new)

Task 2: Word-level QE (new)

Download

Submission Format

Task 1.1 Scoring and ranking for post-editing effort

Task 1.2 System selection

Task 1.3 Predicting post-editing time

Task 2: Word-level QE

Submission Requirements

IMPORTANT DATES

ORGANIZERS

Other Requirements

CONTACT

ACL 2013 EIGHT WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

8-9 August, 2013
Sofia, Bulgaria