Shared Task: Exploiting Parallel Texts for Statistical Machine Translation
June 8 and 9, 2006, in conjunction with NAACL 2006 in New York City
[HOME] | [PROGRAM] | [PROCEEDINGS] | [SHARED TASK] | [BASELINE SYSTEM] | [RESULTS]
The shared task of the workshop is to build a probabilistic phrase translation table for phrase-based statistical machine translation (SMT). Evaluation is translation quality on an unseen test set.
We provide a parallel corpus as training data (with word alignment), a baseline statistical machine translation system, and additional resources. Participants may augment this system or use their own system.
The goals of staging this shared task are:
We hope that both beginners and established research groups
will participate in this task.
- get reference performance numbers
in a large-scale translation task for European languages
- pose special challenges with word order (German-English) and
translating from English into foreign languages
- offer interested parties a (relatively) smooth start with
hands-on experience in state-of-the-art statistical machine translation
- create publicly available data for machine translation
and machine translation evaluation
We provide training data for three European language pairs, and a common framework (including a language model and a basline system). The task is to improve methods to build a phrase translation table (e.g. by better word alignment, phrase extraction, phrase scoring), augment the system otherwiese (e.g. by preprocessing), or build entirely new translation systems.
The participants' system is used to translate a test set of unseen sentences in the source language. The translation quality is measured by the BLEU score, which measures overlap with a reference translation, and manual evaluation. Participants agree to contribute to the manual evaluation about eight hours of work.
To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide
Optionally, you may use
- a fixed training set
- a fixed language model
- a fixed baseline system
Most current methods to train phrase translation tables build on a word alignment (i.e., the mapping of each word in the source sentence to words in the target sentence). Since word alignment is by itself a difficult task, we provide word alignments. These word alignments are acquired by automatic methods, hence they contain errors. You may get better performance by coming up with your own word alignment.
- a provided word alignment
We also strongly encourage your participation, if you use
Your submission report should highlight in which ways your own methods and data differ from the standard task. We may break down submitted results in different tracks, based on what resources were used.
- your own training corpus
- your own sentence alignment
- your own language model
- your own decoder
The provided data is taken from the Europarl corpus, which is freely available. Please click on the links below to download the data.
If you prepare training data from the Europarl corpus directly, please do not take data from Q4/2000 (October-December), since it is reserved for development and test data.
Note that the training data is not lowercased. This may be useful for tagging and parsing tools. However, the phrase translation tables and language model use lowercased text. Since the provided development test set and final test set are mixed-cased, they have to be lowercased before translating.
To tune your system during development, we provide a development set of 2000 sentences.
This data is identical with the 2005 development test data.
Development Test Data
To test your system during development, we provide a development test set of 2000 sentences.
This data is identical with the 2005 test data.
To test your system, translate the following 3064 sentences and send the output per email to email@example.com
- English (to be translated to French, Spanish and German)
- French (to be translated to English)
- Spanish (to be translated to English)
- German (to be translated to English)
Evaluation will be done both automatically as well as by human judgement.
- Automatic Scoring: We will use the BLEU score, a reference implementation is multi-bleu.perl.
- Manual Scoring: We will collect judgments about adequacy and fluency from human annotators. If you participate in the evaluation, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool, which you can play with here (using the WPT'05 submissions).
March 20: Test data released (available on this web site)
March 31: Results submissions (by email to firstname.lastname@example.org)
April 7: Short paper submissions (4 pages)
Philipp Koehn (University of Edinburgh)
Christof Monz (University of London)