Shared Task: Machine Translation for European Languages
March 30-31, in conjunction with EACL 2009 in Athens, Greece
[HOME] | [TRANSLATION TASK] | [SYSTEM COMBINATION TASK]
| [EVALUATION TASK] | [RESULTS] | [BASELINE SYSTEM]
| [SCHEDULE] | [AUTHORS] | [PAPERS]
The results of the shared task are summarized in the paper:
-
Findings of the 2009 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz and Josh Schroeder
pdf
bib
The raw data is available for download here:
You can also download the raw data from WMT07 and WMT08.
The format of the CSV judgment data columns is as follows:
- Task, e.g. WMT09 Spanish-English News
- Type (RANK or EDIT or EDIT_ACCEPT)
- Item ID (sentence number)
- Annotator ID (numerical ID since anonymized)
- Annotator Type Count (the number of judgments of this type that the annotator completed so far -- used to discard initial judgments to produce Figures 4 and 5 in the overview paper.)
- Time spent on annotation (in seconds)
- System name, e.g., uedin
- For RANK there are ranks for up to 5 systems:
- SCORE_A: this ranges from 1 to 5. 1 is the best and 5 is the worst. Ties were allowed.
- For EDIT there is information about 1 systems:
- SCORE_A: this is BAD, EDIT, or OK, which indicate that the sentence was too bad to edit, or that it was edited, or that it was good enough that it did not require editing.
- EXTRA: this field contains the edited sentence when SCORE_A is EDIT, and nothing otherwise
- For EDIT_ACCEPT there is information about up to 5 systems:
- SCORE_A: 0 or 1 - 1 means that the edited translation was judged to be a fully fluent and meaning-equivalent alternative to the reference sentence, and 0 means that it was not
- SCORE_B: this is the ID of the annotator whose edited sentence is being judged
System-level human and automatic rankings are stored in a tab-delimited delimited with the following columns:
- Metric: This includes automatic scores from bleu, bleu_cased, bleu_ter, bleusp, bleusp4114, maxsim, meteor-0.6, meteor-0.7, meteor-ranking, nist, nist_cased, rte_absolute, rte_pairwise, sempos, ter, terp, ulc, wcd6p4er, wpF, and wpbleu, and the Rank human score.
- Language pair: e.g. fr-en
- Test set: always newstest2009, since we only had one test set this year
- System name
- Score
Sentence-level automatic metric scores are stored in a tab-delimited delimited with the following columns:
- Metric: Includes scores from bleusp, bleusp4114, MAXSIM, meteor-0.6, meteor-0.7, meteor-ranking, Oracle, random1, random2, RTEAbsolute, RTEPairwise, TER, TERp_A, ULC, wcd6p4er, wpbleu, wpF
- Language pair: e.g. fr-en
- Test set: always newstest2009, since we only had one test set this year
- System name
- Segment number (indexed from 0)
- Score
| supported by the EuroMatrix project, P6-IST-5-034291-STP funded by the European Commission under Framework Programme 6 |