Workshop Shared Task: Statistical Machine Translation

Shared Task: Machine Translation for European Languages

The results of the shared task are summarized in the paper:

Findings of the 2009 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz and Josh Schroeder
pdf bib

The raw data is available for download here:

You can also download the raw data from WMT07 and WMT08.

The format of the CSV judgment data columns is as follows:

Task, e.g. WMT09 Spanish-English News
Type (RANK or EDIT or EDIT_ACCEPT)
Item ID (sentence number)
Annotator ID (numerical ID since anonymized)
Annotator Type Count (the number of judgments of this type that the annotator completed so far -- used to discard initial judgments to produce Figures 4 and 5 in the overview paper.)
Time spent on annotation (in seconds)
System name, e.g., uedin
For RANK there are ranks for up to 5 systems:
- SCORE_A: this ranges from 1 to 5. 1 is the best and 5 is the worst. Ties were allowed.
For EDIT there is information about 1 systems:
- SCORE_A: this is BAD, EDIT, or OK, which indicate that the sentence was too bad to edit, or that it was edited, or that it was good enough that it did not require editing.
- EXTRA: this field contains the edited sentence when SCORE_A is EDIT, and nothing otherwise
- For EDIT_ACCEPT there is information about up to 5 systems:
  - SCORE_A: 0 or 1 - 1 means that the edited translation was judged to be a fully fluent and meaning-equivalent alternative to the reference sentence, and 0 means that it was not
  - SCORE_B: this is the ID of the annotator whose edited sentence is being judged

System-level human and automatic rankings are stored in a tab-delimited delimited with the following columns:

Metric: This includes automatic scores from bleu, bleu_cased, bleu_ter, bleusp, bleusp4114, maxsim, meteor-0.6, meteor-0.7, meteor-ranking, nist, nist_cased, rte_absolute, rte_pairwise, sempos, ter, terp, ulc, wcd6p4er, wpF, and wpbleu, and the Rank human score.
Language pair: e.g. fr-en
Test set: always newstest2009, since we only had one test set this year
System name
Score

Sentence-level automatic metric scores are stored in a tab-delimited delimited with the following columns:

Metric: Includes scores from bleusp, bleusp4114, MAXSIM, meteor-0.6, meteor-0.7, meteor-ranking, Oracle, random1, random2, RTEAbsolute, RTEPairwise, TER, TERp_A, ULC, wcd6p4er, wpbleu, wpF
Language pair: e.g. fr-en
Test set: always newstest2009, since we only had one test set this year
System name
Segment number (indexed from 0)
Score

supported by the EuroMatrix project, P6-IST-5-034291-STP
funded by the European Commission under Framework Programme 6