Workshop Shared Task: Automatic Evaluation of Translation Quality

EACL 2009 FOURTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Machine Translation for European Languages

March 30-31, in conjunction with EACL 2009 in Athens, Greece

The shared evaluation task of the workshop will examine automatic evaluation metrics for machine translation. We will provide all of the translations produced in the translation task and the system combination task, as well as the reference human translations. You will return rankings for each of the translations at the system-level and/or at the sentence-level. We will calculate the correlation of your rankings with the human judgements once the manual evaluation has been completed.

Goals

The goals of the shared evaluation task are:

To achieve the strongest correlation with human judgments of translation quality
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluations
To address the problems associated with comparing against a single reference translation
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking

Task Description

We will provide the output of machine translation systems for five different language pairs (French-English, Spanish-English, German-English, Czech-English, Hungarian-English), and will give you the reference translations in each of those languages. You will provide scores for each of the outputs at the system-level and the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and cannot work with translations into languages other than English, you are free to assign scores only for translations into English.

We will measure the goodness of automatic evaluation metrics in the following ways:

System-level correlation: We will use Spearman's rank correlation coefficient (rho) to measure the correlation of the automatic metrics with the human judgments of translation quality at the system-level. The human ranking of the systems will be based on the manual evaluation. A system's rank will be assigned based on the percent of time that the sentences it produced were judged to be better than or equal to the translations of any other system. Since automatic metrics generally assign a score rather than a rank, we will convert their raw scores into ranks prior to calculating rho.
Consistency at the sentence-level: Rather than calculating a correlation co-efﬁcient at the sentence-level we will instead measure how consistent the automatic metrics are with the human judgments. Consistency will be calculated as follows: for every pairwise comparison of two systems output for a single sentence, we will counted the automatic metric as being consistent if its relative scores are the same as the human judgment for that sentence (i.e. the metric assigned a higher score to the higher ranked system). Because the automatic metrics generally assign real numbers as scores, we excluded pairs that the human annotators ranked as ties.

Submission Format

Once we receive the system outputs from the translation task and the system combination task we will post all of the system outputs for you to score with your metric. The translations will be distributed as plain text files with one translation per line. (We can also provide the output in the NIST MT Evaluation Workshop's XML format. Please contact us if this format is easier for you.)

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

Output file format for system-level rankings

The output files for system-level rankings should be formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SYSTEM LEVEL SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (cz for Czech, de for German, en for English, es for Spanish, fr for French, hu for Hungarian). You should use de-en for German-English, for example.
TEST SET is the ID of the test set (given by the setid attribute of of the tstset tag in the XML file, or by the directory structure in the plain text files).
SYSTEM is the ID of system being scored (given by the sysid attribute in the XML document, or as part of the filename for the plain text file).
SYSTEM LEVEL SCORE is the overall system level score.

Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be formatted in the following way:

<METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM>   <SEGMENT NUMBER>   <SEGMENT SCORE>

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages.
TEST SET is the ID of the test set.
SYSTEM is the ID of system being scored.
SEGMENT NUMBER is the line number starting from one of the plain text input files.
SEGMENT SCORE is the score for the particular segment.

Each field should be delimited by a single tab character.

Changes This Year

There are several refinements over last year's shared evaluation task:

Correlation with Manual Sentence Ranking: Last year we reported how well the automatic metrics correlated with several different types of human judgments. This year we will focus on a single type of human judgment to make the results clearer. Specifically, we will focus on sentence-level ranking, as described in the overview paper for last year's workshop.
News Translation Only: The test set that will be used in the translation task will be focused on translations of news stories, and we will discontinue the use of portions of the Europarl corpus as a test set.

Last Year's Data

The system outputs and human judgments from last year's workshop is available for download from http://www.statmt.org/wmt08/results.html. You can use this to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from last year.

Last year's data contains all of the system's translations, the source documents and reference human translations and the human judgments of the translation quality.

Other Requirements

If you participate in the evaluation shared task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (up to 4 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Dates

January 9, 2009: Short paper submissions (4 pages)
January 12, 2009: System translations released (sgml format or txt). There are also contrastive systems if you have enough time to run you metric over them (Contrastive: sgml format or txt)
January 23, 2009: Automatic evaluation scores due (by email to ccb@cs

jhu

edu)

supported by the EuroMatrix project, P6-IST-5-034291-STP
funded by the European Commission under Framework Programme 6

EACL 2009 FOURTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION