Workshop Shared Task: System Combination Task

EACL 2009 FOURTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Automatic System Combination

March 30-31, in conjunction with EACL 2009 in Athens, Greece

The system combination task of the workshop will focus on processing all of the system translations produced in the translation task. You will be provided with the submissions of all entrants to this year's translation task, split into tuning and testing sets, as well as references for the tuning portion of the data. You will asked to return your combination of translations of the test set.

Goals

The goals of the system combination task are:

To investigate the applicability of current combination techniques when translating into languages other than English
To generate up-to-date performance numbers for European languages in order to provide a basis of comparison in future research
To provide the opportunity for human assesment of system combination output
To compare automatic metric and human assesments of system combination output

We hope that both beginners and established research groups will participate in this task with both novel and established combination techniques. We welcome everything from simple translation output selection to advanced consensus decoding techniques. As with the shared translation task, participants agree to contribute about eight hours of work to the manual evaluation.

Changes This Year

Last year we conducted a pilot system combination task with several invited participants. Last year's entrants used the Europarl test set (test2008) as tuning data and the News test set (news-test2008) for evaluation. This year we only have one test set, from the news domain, for the translation task.

Last year's translation task submissions for news-test2008 will be made available as early-release training data. Obviously, this data was generated by different systems than those that will be submitted this year, so may not be useful for tuning system-specific feature weights. With one test set this year across all language pairs, we will split the system data we receive into two sets, with tuning being approximately 500 lines and testing approximately 2500 lines. This will provide the opportunity for system combination entrants to learn weights for this year's systems.

As with the shared translation task, we are only translating a set of news stories prepared for this evaluation, not Europarl proceedings. As in the previous year, the news stories are taken from major news outlets such as the BBC, Der Spiegel, Le Monde, etc. during the time period of September-October 2008. Evaluation of system combination entrants will be similar to translation entrants, with both human judgements and automaic metrics.

Task Description

System combination submissions will be accepted in all translation tasks for which we receive two or more entrants. We are evaluating in both directions on the following language pairs:

French-English
Spanish-English
German-English
Czech-English
Hungarian-English

Any of the training data provided for the translation task can be used to train language models or other elements needed for your system combination approach. We also allow unconstrained entries, taking special note of the follow information from the translation task:

Constrained vs. Unconstrained: You may use any additional resources that you wish to (including training data, knowledge sources such as existing translation systems), but you should flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

Your submission report should highlight which data you used for training and what unconstrained resources you used. We may break down submitted results in different tracks based on what resources were used. You may submit contrastive runs to demonstrate different techniques or variants of your system, but we cannot guarantee that contrastive systems will receive human evaluation scores.

Training Data

Any of the training data posted for the shared translation task may be used.

Development Data

To tune your system during development, we provide a development set of 2051 sentences. This set was used as last year's news test set. Since most statistical system use a tuning set and a test set during system development, we also provide a version of the development set split up into tuning set (news-dev2009a) and test set (news-dev2009b), consisting of alternating sentences from the original set. The same split has been performed on all submissions to last year's task.

News news-dev2009

English
French
Spanish
German
Czech
Hungarian
This data is a cleaned version of the 2008 test set. News news-dev2009 Submissions

English
French
Spanish
German
Czech
Hungarian
These are the system submissions from 2008.

While the 2008 data won't be useful for tuning submission-specific weights for the 2009 competition, we are providing these sets as a way to jump start training combination systems.

Test Data

The test set is similar in nature as the news-dev2008 developement set. It is taken from identical sources.

Once we have received and processed all submissions to the translation task, we will split the data into approximately 500 lines of tuning data and 2500 lines of test data, and provide sources and references for the tuning set. At that point you can use the tuning data to refine your weights if necessary and return you system combination entry.

Note that we only require 1-best submissions from translation task entrants, so your system combination method should not rely on n-best data. We will request that participants provide n-best output if they can, but there is no guarantee that this data will be available. If you or your group is also participating in the translation task, encourage them to send their n-best output to jschroe1@inf.ed.ac.uk

News news-test2009

English
French
Spanish
German
Czech
Hungarian

Download

Last Year's Submissions - Release includes references and A/B split (41MB)
This Year's Submissions - Tuning and Test data - Release includes newssyscomb2009/newstest2009 split and references for newsysscomb2009 (90MB)

Evaluation

Evaluation will be done both automatically as well as by human judgement.

Manual Scoring: We will collect subjective judgments about translation quality from human annotators. If you participate in the task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.
We expect the translated submissions to be in recased, detokenized, XML format, just as in most other translation campaigns (NIST, TC-Star).

Dates

December 22: 2009 tuning and test data released (available on this web site)
January 5: Results submissions (by email to jschroe1@inf.ed.ac.uk)
January 9: Short paper submissions (4 pages)

supported by the EuroMatrix project, P6-IST-5-034291-STP
funded by the European Commission under Framework Programme 6

EACL 2009 FOURTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION