Biomedical translation Task - EMNLP 2018 Third Conference on Machine Translation

Shared Task: Biomedical Translation Task

Task description

This task aims to evaluate systems on the translation of documents from the biomedical domain. The test data will consist of Medline abstracts and other biomedical documents. This year, the biomedical translation task will address the following language pairs:

English-Chinese and Chinese-English NEW
English-French and French-English
English-German and German-English NEW
English-Portuguese and Portuguese-English
English-Romanian
English-Spanish and Spanish-English

Data

Participants can rely on training (and development) data from various sources, for instance:

The Biomedical Translation repository includes scientific publications (en/pt, en/es, en/fr) and clinical trials (en/pt).
UFAL Medical Corpus (formerly HimLCorpus) includes medical text from various sources for many language pairs (en/es, en/de, en/fr, en/ro). HimL test sets can be used as the development sets for some language pairs (en/es, en/de, en/fr, en/ro).
The Khresmoi development data can be used for some language pairs (en/es, en/de, en/fr).
The UNCorpus contains training data for some languages (en/es, en/fr, en/zh).
The MeSpEn corpus contains many parallel documents of en/es. Please notice that this resource includes Medline data, we kindly ask participants not to train the systems on the Medline collection.

Participants are also free to use out-of-domain data. We kindly ask participants not to train systems on Medline documents, given that our test sets rely on this data.

Evaluation

Evaluation will be carried out both automatically and manually. Automatic evaluation will make use of standard machine translation metrics, such as BLEU. Native speakers of each of the languages will manually check the quality of the translation for a small sample of the submissions. We also expect participants to support us in the manual evaluation (accordingly to the number of submissions). We plan to release test sets for the following language pairs and sources:

Chinese/English (both directions): scientific publications from Medline
French/English (both directions): scientific publications from Medline
Portuguese/English (both directions): scientific publications from Medline
Spanish/English (both directions): scientific publications from Medline
German/English (both directions): scientific publications from Medline
English to Romanian: scientific publications from Medline

Submission format

The format of the test files will be the BioC XML format of the Scielo corpus.

The training data and the test data are available in the BioC format. More information about BioC as well as readers are writer for many programming languages can be found in the BioC web site.

An example of the test set format is shown below for the English to Spanish (en2es) language pair:

<document> <id>S123456789</id> <passage> <infon key="language">EN</infon> <infon key="section">title</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>title sentence</text> </sentence> </passage> <passage> <infon key="language">EN</infon> <infon key="section">abstract</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>sentence 0</text> </sentence> <sentence> <infon key="sentnum">1</infon> <offset>-1</offset> <text>sentence 1</text> </sentence> ... </passage> </document>

An example of the submission format is shown below for the above en2es language pair:

<document> <id>S123456789</id> <passage> <infon key="language">ES</infon> <infon key="section">title</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>translation of title sentence</text> </sentence> </passage> <passage> <infon key="language">ES</infon> <infon key="section">abstract</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>translation of sentence 0</text> </sentence> <sentence> <infon key="sentnum">1</infon> <offset>-1</offset> <text>translation of sentence 1</text> </sentence> ... </passage> </document>

Please identify each sentence with the corresponding "sentnum" specified in the test file. The submission file has the same format of the test file, except for the "language" attribute, which should contain the target language instead of the source language, and the "text" tag, which should contain the translation of the text to the target language.

Submission Requirements

Please register your team using this form. You will receive a mail with the confirmation of your registration. Please register your team as soon as possible, the link for submission will be informed in this mail.

The test files are available in the WMT'18 biomedical task Google Drive folder.

The format for the submission files should included the original test file name preceded by the team identifier (as registered in the form above) and the run number, following this example:

The submission file for run 1 of the "ABC" team for the Medline dataset for English to Spanish should be called "ABC_run1_medline_en2es_es.xml".
The submission file for run 3 of the "ABC" team for the Medline dataset for Spanish to English should be called "ABC_run3_medline_es2en_en.xml".

Each team will be allowed to submit up to 3 runs per test dataset.

Results

Results for the biomedical task are available here.

The gold standard files (and corresponding alignment files) are available in the WMT'18 biomedical task Google Drive folder.
NEW We now include plain text versions of the test and gold standard files which include the automatic alignment. Please check the .tar.gz files. Each line in a file (e.g., line 2 in doc11_pt.txt) should be compared to the corresponding line in the corresponding file in the other language (e.g., line 2 in doc11_en.txt).

Important dates

Release of test data	June 8th, 2018
Results submission deadline	~~June 15th, 2018~~, ~~June 20th, 2018~~, June 22nd, 2018
Paper submission deadline	July 27th, 2018 (TBC)
Paper notification	August 18th, 2018 (TBC)
Camera-ready version due	August 31st, 2018 (TBC)
Conference in Brussels	October 31 - November 1, 2018

Organisers

Cristian Grozea (Fraunhofer Institute, Germany)
Antonio Jimeno Yepes (IBM Research Australia)
Madeleine Kittner (Humboldt-Universität zu Berlin, Germany)
Aurélie Névéol (LIMSI, CNRS, France)
Mariana Neves (German Federal Institute for Risk Assessment, Germany)
Amy Siu (Beuth University of Applied Sciences, Germany)
Karin Verspoor (University of Melbourne, Australia)

Please contact us in the mail wmtbiomedical@gmail.com. Please also joing our discussion forum.