Shared Task: Biomedical Translation Task
Task description
This task aims to evaluate systems on the translation of documents from the biomedical domain.
The test data will consist of Medline abstracts and other biomedical documents.
This year, the biomedical translation task will address the following language pairs:
- English-Chinese and Chinese-English
- English-French and French-English
- English-German and German-English
- English-Portuguese and Portuguese-English
- English-Spanish and Spanish-English
Data
Participants can rely on training (and development) data from various sources, for instance:
- The Biomedical Translation repository includes
scientific publications (en/pt, en/es, en/fr) and clinical trials (en/pt).
- The Medline abstracts training data for es/en, de/en, pt/en and fr/en. NEW
- The UFAL Medical Corpus (formerly HimLCorpus) includes
medical text from various sources for many language pairs (en/es, en/de, en/fr, en/ro).
HimL test sets can be used as the development sets for some language pairs (en/es, en/de, en/fr, en/ro).
- The Khresmoi development data can be used for some language pairs (en/es, en/de, en/fr).
- The UNCorpus contains training data for some languages (en/es, en/fr, en/zh).
- The MeSpEn corpus contains many parallel documents of en/es.
- The Scielo full text corpus for en, es and pt.
- The Brazilian Thesis and Dissertations for en/pt.
Participants are also free to use out-of-domain data.
Evaluation
Evaluation will be carried out both automatically and manually.
Automatic evaluation will make use of standard machine translation metrics, such as BLEU.
Native speakers of each of the languages will manually check the quality of the translation for a small sample of the submissions.
We also expect participants to support us in the manual evaluation (accordingly to the number of submissions).
We plan to release test sets for the following language pairs and sources:
- Scientific abstracts from Medline:
- Chinese/English (both directions)
- French/English (both directions)
- Portuguese/English (both directions)
- Spanish/English (both directions)
- German/English (both directions)
- Translation of terms from biomedical terminologies:
Test Sets and Submission formats
Scientific abstracts from Medline:
For the test set of Medline abstracts, the format will be plain text files.
The format will be the following:
DOC_ID SENT_ID SENTENCE_TEXT
The three values are separated by a TAB character:
- DOC_ID: sequential one, e.g. doc1, doc8, not the original PMID in Medline
- SENT_ID: a sequential number from 1 to n
- SENTENCE_TEXT: the sentence text to be translated by the participants
doc1 1 sentence_1
doc2 2 sentence_2
doc2 3 sentence_3
doc2 4 sentence_4
doc2 5 sentence_5
...
doc2 n sentence_n
doc4 1 sentence_1
doc4 2 sentence_2
...
The format for the submission will be the same, such as in the example below.
The participants should
doc1 1 translated_sentence_1
doc2 2 translated_sentence_2
doc2 3 translated_sentence_3
doc2 4 translated_sentence_4
doc2 5 translated_sentence_5
...
doc2 n translated_sentence_n
doc4 1 translated_sentence_1
doc4 2 translated_sentence_2
...
Terms from biomedical terminologies:
For the translation of terms, the format of the test set will be one term per line, as in the example below:
Coronal craniosynostosis
growth hormone treatment
Stenosis of right renal artery
17q21 Microdeletion Syndrome
Paraplegia/paraparesis
Mitochondrial pathology
acute aortic dissection
Glaucoma of left eye
Otitis media of left ear
Renal cortical atrophy
The format for the submission will be the same, such as in the example below.
The participants should follow the same order of the terms as in the original test set file.
Craneosinostosis coronal
tratamiento con hormona del crecimiento
Estenosis de la arteria renal derecha
Síndrome de microdeleción 17q21
Paraplejia/paraparesia
Enfermedad mitocondrial
disección aórtica aguda
Glaucoma del ojo izquierdo
Otitis media del oído izquierdo
Atrofia cortical renal
The evaluation will be case insensitive (accuracy) and based on a subset of the test data which will be manually translated and handled as gold stardard.
Submission Requirements
Please register your team using this form.
You will receive a mail with the confirmation of your registration.
The link for submission site will be informed in this mail
Please register your team as soon as possible.
The test files will be available in the
WMT'19 biomedical task Google Drive folder.
The format for the submission files should include the original test file name preceded by the team identifier
(as registered in the form above) and the run number, following this example:
- The submission file for run 1 of the "ABC" team for the Medline dataset for English to Spanish should be called
"ABC_run1_medline_en2es_es.txt".
- The submission file for run 3 of the "ABC" team for the Medline dataset for Spanish to English should be called
"ABC_run3_medline_es2en_en.txt".
A similar format should be followed for the terminology sub-task.
However, there is no need to identifiy the languages, since this task only addressed English to Spanish:
- The submission file for run 1 of the "ABC" team for the terminology dataset should be called "ABC_run1_terms.txt".
Each team will be allowed to submit up to 3 runs per test set.
Results
Results for the biomedical task are available here.
The keys for the Medline test sets are available in the "test_sets/Medline/gold_standard/" folder of the
WMT'19 biomedical task Google Drive folder.
The following files are included:
- Ten "medline_*" files with the reference sentences for each Medline test set.
- Five "mapdocs_*" files for each language pair for mapping document sequentials to PubMed PMIDs.
- Five "align_validation_*" files for each language pair with the manual validation of the automatic alignment.
The order of the languages in the alignment (column 2 in the TAB-separated file) is foreign language vs. (<=>) English.
Important dates
Release of test data | 26 April, 2019 |
Results submission deadline | 3 May, 2019 |
Paper submission deadline | May 17, 2019 |
Paper notification | June 7, 2019 |
Camera-ready version due | June 17, 2019 |
Conference in Florence | August 1 - 2, 2019 |
Organisers
Cristian Grozea (Fraunhofer Institute, Germany)
Antonio Jimeno Yepes (IBM Research Australia)
Madeleine Kittner (Humboldt-Universität zu Berlin, Germany)
Martin Krallinger (Centro Nacional de Investigaciones Oncológicas - CNIO, Barcelona Supercomputing Center - BSC, Spain)
Aurélie Névéol (LIMSI, CNRS, France)
Mariana Neves (German Federal Institute for Risk Assessment, Germany)
Felipe Soares (Barcelona Supercomputing Center - BSC, Spain)
Amy Siu (Beuth University of Applied Sciences, Germany)
Karin Verspoor (University of Melbourne, Australia)
Please contact us in the mail wmtbiomedical@gmail.com.
Please also joing our discussion forum.