The medical text translation task of WMT14 focuses on translation of texts from the medical domain. The task is split into two subtasks:
The goal of the medical text translation task is to investigate the applicability of current MT techniques to the translation o domain-specific and genre-specific texts. We encourage both beginners and established research groups to participate in this novel task.
Texts from specific domains (such as medicine) and genre (such as search queries) are characterised by frequent occurrence of specific vocabulary and syntactic constructions which are rare or even absent in traditional general-domain training data and therefore difficult to translate for an SMT system. In-domain training data for such specific purposes is usually scarce or not available at all.
Medicine is an example of a domain for which some in-domain training data is available. We provide links to such resources for four European languages: Czech, English, French, and German. These resources can be used to train an SMT system from scratch or to adapt an existing one. The task is to improve the current methods of machine translation and its domain/genre adaptation. They will use their systems to translate test sets consisting of unseen sentences in the source language. The translation quality will be measured by various automatic evaluation metrics.
For the first subtask, English test sentences were randomly sampled from automatically generated summaries of documents containing medical information aimed at general public and medical professionals, found to be relevant to 50 topics provided for the CLEF 2013 eHealth Task 3. Out-of-domain and ungrammatical sentences were manually removed. The development and test setences are provided with information on document ID and topic ID. The topic descriptions are provided as well. The sentences were translated by medical experts into Czech, French, and German. The translations were further reviewed.
For the second subtask, English test queries were randomly sampled from real user query logs provided by the Health on the Net foundation and the Trip database. The queries were translated into Czech, German, and French by medical experts and reviewed.
You may participate in any or all of the following language pairs (both directions):
If you use additional training data (beyond the resources listed on this page below) or existing translation systems (e.g. on-line systems), you must indicate upon submission that your system uses additional resources. We will distinguish system submissions that used the provided in-domain training data and the data provided for the standard translation task (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.
We provide links to several in-domain data resources that are freely available for research purposes. Some of the resources require user registration and licence agreement. To lower the barrier to entry, we provide a set of easy-to-use scripts to extract parallel data in the plain-text sentence-aligned format and monolingual plain texts for language modelling. See the download section below.
The data is provided in plain text format and in an SGML format that suits the NIST scoring tool.
Data set | Parallel sentences | Links | Notes |
---|---|---|---|
EMEA | 1M | CS-EN, DE-EN, FR-EN | Direct download. |
COPPA | 1.6M | FR-EN | Provided on DVD, data sent on request. The extraction script splits the data into in-domain and out-of-domain. |
MuchMore | 29K | DE-EN | Direct download (two files!). |
PatTR | 1.8M-2.2M* | DE-EN, FR-EN | Direct download. The extraction script splits the data into in-domain and out-of-domain. |
UMLS | 116K-675K* | ALL | Provided upon registration (download the 2013AB Full Release). The script extracts term-to-term translation dictionary. | Wikipedia titles | 3K-10K* | CS-EN, DE-EN, FR-EN | Direct download, provided by Charles University in Prague. |
Corpus | Sentences | Tokens | Links | Notes |
---|---|---|---|---|
AACT | >3.1M | 58.7M | EN | Direct download. |
DrugBank | 23K | 826K | EN | Direct download. |
GENIA | 18K | 557K | EN | Direct download. |
GREC | 1K | 62K | EN | Direct download. |
FMA | 150K | 884K | EN | Direct download. |
PatTR descriptions | 1-1.5M* | 38M-52M* | DE-EN, FR-EN | Direct download (the same source as for the parallel data above). The script extracts monolingual sentences from the descriptions section. It splits the data into in-domain and out-of-domain. |
PIL | 20K | 567K | EN | Direct download. |
UMLS descriptions | 3K-200K* | 1K-6.3M* | ALL | Provided upon registration (the same source as for the parallel data above). The script extracts monolingual sentences from term descriptions. |
Wikipedia articles | 50K-562K* | 2M-23M* | EN, CS, DE, FR | Direct download, provided by Charles University in Prague |
A set of scripts that extract plain text sentence-aligned parallel data and plain-text monolingual data for language modelling from the original packages can be downloaded here.
For intrinsic evaluation (translation quality), convert your output files into the SGML format required by the NIST evaluation tool (see the instructions for the standard task here), and upload your translations of the khresmoi-summary and khresmoi-query test sets (any translation direction) to the matrix.statmt.org.
The first submitted run will be considered primary. Other runs (if any) will be considered contrastive.
For extrinsic evaluation (cross-lingual information retrieval quality), convert your translations of the clir-query test sets (from any language to English) with 10-best distinct translations of each query in the following format based on the NIST SGML, where each "seg" element has a new "rank" attribute which is an integer value ranging from 1 to 10 and corresposnding to the ranks of the translations variants.
Example:
<seg id="1" rank="1">test translation variant 1</seg>
<seg id="1" rank="2">test translation variant 2</seg>
...
<seg id="1" rank="10">test translation variant 10</seg>
Submit your results via email to Pavel Pecina (pecina@ufal.mff.cuni.cz). We reserve the right to evaluate a limited number of contrastive submissions from each participant.
Evaluation will be done automatically using common evaluation metrics. We expect the translated submissions to be in recased, detokenized, SGML format.
Task announcement | December 12, 2013 |
Release of development test sets | January 6, 2014 |
Release of test sets | March 10, 2014 |
Submission of translations | March 14, 2014 |
Submission of papers | April 1, 2014 |
You are invited to submit a report about your approach. Your submission report should highlight in which ways your own methods and data differ from the standard approaches.
We thank all the data providers for granting the license, especially Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and other data providers to provide the summary sentences. We thank the expert translators for translating the data.
For questions, comments, etc. please send an email to Pavel Pecina (pecina@ufal.mff.cuni.cz).
Supported by the European Commision
under the
Khresmoi project
(grant number 257528).