Shared Task: Biomedical Translation Task
Task description
This task aims to evaluate systems on the translation of documents from the biomedical domain.
The test data will consist of biomedical abstracts and summaries of proposals for animal experiments.
This year, the biomedical translation task will address the following language pairs:
- English-Basque (en/eu)
- English-Chinese and Chinese-English (en/zh, zh/en)
- English-French and French-English (en/fr, fr/en)
- English-German and German-English (en/de, de/en)
- English-Italian and Italian-English (en/it, it/en)
- English-Portuguese and Portuguese-English (en/pt, pt/en)
- English-Russian and Russian-English (en/ru, ru/en)
- English-Spanish and Spanish-English (en/es, es/en)
Data
ATTENTION!!
We ask the participants to not download the Medline database by themselves in order to retrieve training data.
Submissions that are derived from a model that was trained on the whole PubMed will be not considererd in the evaluation.
Participants can rely on training (and development) data from various sources, for instance:
- The Biomedical Translation repository includes links to
parallel corpora of scientific publications (en/pt, en/es, en/fr, en/de, en/zh, en/it, en/ru), among others
- The UFAL Medical Corpus (formerly HimLCorpus) includes
medical text from various sources for many language pairs (en/es, en/de, en/fr, en/ro)
HimL test sets can be used as the development sets for some language pairs (en/es, en/de, en/fr, en/ro)
- The Khresmoi development data can be used for some language pairs (en/es, en/de, en/fr).
- The UNCorpus contains training data for some languages (en/es, en/fr, en/zh)
- The MeSpEn corpus contains many parallel documents of en/es
- The Scielo full text corpus for en, es and pt
- The Brazilian Thesis and Dissertations for en/pt
- The ICD-10 codes translation train/dev datasets for en/eu
- The medical domain monolingual corpora includes hospital notes,
medical domain wikipedia articles and medical dictionaries (eu)
- Out-of-domain monolingual corpora for eu
- Out-of-domain parallel corpora (en/eu)
- Osagaiz Biomedical abstract translations (en-eu)
Participants are also free to use out-of-domain data.
Other resources:
- BERTeus, pre-trained model for Basque
Evaluation
Evaluation will be carried out both automatically and manually.
Automatic evaluation will make use of standard machine translation metrics, such as BLEU.
Native speakers of each of the languages will manually check the quality of the translation for a small sample of the submissions.
If necessary, we also expect participants to support us in the manual evaluation (accordingly to the number of submissions).
We plan to release test sets for the following language pairs and sources:
- Scientific abstracts:
- English to Basque
- Chinese/English (both directions)
- French/English (both directions)
- German/English (both directions)
- Italian/English (both directions)
- Portuguese/English (both directions)
- Russian/English (both directions)
- Spanish/English (both directions)
- Translation of terms from biomedical terminologies:
- Summaries of proposals for animal experiments: NEW
Test Sets and Submission formats
Terms from biomedical terminologies:
For the translation of terms, the format of the test set will be one term per line, as in the example below:
Traumatic rupture of left ulnar collateral ligament, subsequent encounter
posterior dislocation of left hip, sequela
Anterior subluxation of right humerus
other subluxation of left foot, subsequent encounter
Hypothermia of newborn
Absence and agenesis of lacrimal apparatus
poisoning by unspecified primarily systemic and hematological agent, assault, initial encounter
Disorders of glycoprotein metabolism
The format for the submission will be the same, such as in the example below. The participants should follow the same order of the terms as in the original test set file.
ezkerreko kubitu aldeko lotailu albokideko etendura traumatikoa, segidako harremana
ezkerreko aldakako atzeko lokadura, sekuela
eskuineko besahezurraren aurreko subluxazio
ezkerreko oineko beste subluxazio batzuk, segidako harremana
jaioberriaren hipotermia
malko-aparatuaren absentzia eta agenesia
nagusiki sistemikoa eta hematologikoa den agente zehaztugabeak eragindako pozoidura, erasoa, hasierako harremana
glukoproteinen metabolismoaren nahasmenduak
Scientific abstracts:
For the test set of Medline abstracts, the format will be plain text files.
The format will be the following:
DOC_ID SENT_ID SENTENCE_TEXT
The three values are separated by a TAB character:
- DOC_ID: sequential one, e.g. doc1, doc8, not the original PMID in Medline
- SENT_ID: a sequential number from 1 to n
- SENTENCE_TEXT: the sentence text to be translated by the participants
doc1 1 sentence_1
doc2 2 sentence_2
doc2 3 sentence_3
doc2 4 sentence_4
doc2 5 sentence_5
...
doc2 n sentence_n
doc4 1 sentence_1
doc4 2 sentence_2
...
The format for the submission will be the same, such as in the example below.
The participants should follow the same order of the sentences as in the original test set file.
doc1 1 translated_sentence_1
doc2 2 translated_sentence_2
doc2 3 translated_sentence_3
doc2 4 translated_sentence_4
doc2 5 translated_sentence_5
...
doc2 n translated_sentence_n
doc4 1 translated_sentence_1
doc4 2 translated_sentence_2
...
Summaries of proposals for animal experiments:
For the test set of summaries, the format will be plain text files.
The format will be the following:
DOC_ID SECTION_ID SENT_ID SENTENCE_TEXT
The four values are separated by a TAB character:
- DOC_ID: sequential one, e.g. doc1, doc8
- SECTION_ID: a code made of three characters
- SENT_ID: a sequential number from 1 to n
- SENTENCE_TEXT: the sentence text to be translated by the participants
doc1 Ttl 1 sentence
doc1 Zwc 1 sentence
doc1 Ntz 1 sentence
doc1 Ntz 2 sentence
doc1 Sch 1 sentence
...
doc1 Sec n sentence
doc2 Ttl 1 sentence
doc2 Zwc 1 sentence
...
The format for the submission will be the same, such as in the example below.
The participants should follow the same order of the sentences as in the original test set file.
doc1 Ttl 1 translated_sentence
doc1 Zwc 1 translated_sentence
doc1 Ntz 1 translated_sentence
doc1 Ntz 2 translated_sentence
doc1 Sch 1 translated_sentence
...
doc1 Sec n translated_sentence
doc2 Ttl 1 translated_sentence
doc2 Zwc 1 translated_sentence
...
Submission Requirements
Please notice that, following general WMT policy explicitly enforced in other tasks, we will release all participants'
submissions after this year's edition of the task to promote further studies.
Please register your team using this form.
You will receive a mail with the confirmation of your registration.
The link for submission site will be informed in this mail Please register your team as soon as possible.
The test files will be available in the WMT'21 biomedical task Google Drive folder.
The format for the submission files should include the original test file name preceded by the team identifier
(as registered in the form above) and the run number, following this example for the abstracts:
- The submission file for run 1 of the "ABC" team for the Medline abstracts from English to Spanish should be called
"ABC_run1_abstract_en2es_es.txt".
- The submission file for run 3 of the "ABC" team for the Medline abstracts from Spanish to English should be called
"ABC_run3_abstract_es2en_en.txt".
- The submission file for run 1 of the "ABC" team for the abstracts from English to Basque should be called
"ABC_run1_abstract_en2eu_eu.txt".
A similar format should be followed for the terminology sub-task.
However, there is no need to identifiy the language, since this task only addresses English to Basque:
- The submission file for run 1 of the "ABC" team for the terminology test set should be called "ABC_run1_terms.txt".
A similar format should be followed for the summary of proposal for animal experiments sub-task.
However, there is no need to identifiy the language, since this task only addresses German to English:
- The submission file for run 1 of the "ABC" team for the summaries of proposal for animal experiments should be called "ABC_run1_summary.txt".
Each team will be allowed to submit up to 3 runs per test set.
Results
Initial results for the biomedical task are available here.
Important dates
Release of test data | June 28th 2021 |
Results submission deadline | July 5th 2021 July 7th 2021 |
Paper submission deadline | August 5, 2021 |
Paper notification | September 5, 2021 |
Camera-ready version due | September 5, 2021 |
Conference EMNLP | November 10-11, 2021 |
Organisers
Rachel Bawden (University of Edinburgh, UK)
Giorgio Maria Di Nunzio (University of Padua, Italy)
Cristian Grozea (Fraunhofer Institute, Germany)
Iñigo Jauregi (University of Technology Sydney, Australia)
Antonio Jimeno Yepes (University of Melbourne, Australia)
David Martinez (University of Melbourne, Australia)
Aurélie Névéol (Université Paris Saclay, CNRS, LISN, France)
Mariana Neves (German Federal Institute for Risk Assessment, Germany)
Maite Oronoz (University of the Basque Country)
Olatz Perez de Viñaspre (University of the Basque Country)
Roland Roller (DFKI, Germany)
Amy Siu (Beuth University of Applied Sciences, Germany)
Philippe Thomas (DFKI, Germany)
Federica Vezzani (University of Padua, Italy)
Maika Vicente Navarro, Maika Spanish Translator, Melbourne, Australia
Dina Wiemann (Novartis, Switzerland)
Lana Yeganova (NCBI/NLM/NIH, USA)
Please contact us in the mail wmtbiomedical@gmail.com.
Please join our discussion forum.