ACL 2014 Ninth Workshop on Statistical Machine Translation

ACL 2014 NINTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Medical Translation

26-27 June 2014
Baltimore, USA

The medical text translation task of WMT14 focuses on translation of texts from the medical domain. The task is split into two subtasks:

translation of sentences from summaries of medical articles,
translation of queries entered by users of medical information search engines.

In each subtask, the translation quality will be evaluated on shared, unseen test sets, provided by the EU FP7 project Khresmoi. On top of the resources available for the translation task, we provide links to additional in-domain data for training and tuning. Participants may train/tune their system using the provided resources (constrained task) or any additional resource (unconstrained task).

NEWS

Mar 14th, 2014 - Deadline for test set submission extended by 48 hrs to midnight Mar 16th, 2014 (any timezone).
Mar 14th, 2014 - The value of attribute "SrcLang" in the SGML versions of the test sets changed to "any". Please, use the updated files to generate your test set submissions.
Mar 10th, 2014 - The test sets have been released. The khresmoi-summary and khresmoi-query test sets have been added to the packages with the development data (please download the packages again). The clir-query test set for extrinsic evaluation is available for separate download here. For submission instructions look here.
Feb 20th, 2014 - The query translation subtask will be evaluted intrinsically using BLEU and PER and also extrinsically in a state-of-the-art IR system. We will ask the participants to provide 10-best translations for 55 queries from Czech, French, and German to English.
Feb 20th, 2014 - Note: The first half of the query dev set comes from general public, the second half from medical professionals. The same will apply for the test sets too.
Feb 20th, 2014 - A bug fixed in the extraction script for the MuchMore corpus. Please download the scripts again.
Jan 16th, 2014 - Note: Use the latest "2013AB Full Release" of UMLS in the constrained task.
Jan 16th, 2014 - Summary translation development data made available and task description updated.
Jan 6th, 2014 - Query translation development data made available.
Jan 2nd, 2014 - Parallel PatTR corpus also available for FR-EN and allowed in the constrained task.
Jan 2nd, 2014 - Monolingual UMLS descriptions in CS, DE, FR also allowed in constrained task.

GOAL

The goal of the medical text translation task is to investigate the applicability of current MT techniques to the translation o domain-specific and genre-specific texts. We encourage both beginners and established research groups to participate in this novel task.

TASK DESCRIPTION

Texts from specific domains (such as medicine) and genre (such as search queries) are characterised by frequent occurrence of specific vocabulary and syntactic constructions which are rare or even absent in traditional general-domain training data and therefore difficult to translate for an SMT system. In-domain training data for such specific purposes is usually scarce or not available at all.

Medicine is an example of a domain for which some in-domain training data is available. We provide links to such resources for four European languages: Czech, English, French, and German. These resources can be used to train an SMT system from scratch or to adapt an existing one. The task is to improve the current methods of machine translation and its domain/genre adaptation. They will use their systems to translate test sets consisting of unseen sentences in the source language. The translation quality will be measured by various automatic evaluation metrics.

For the first subtask, English test sentences were randomly sampled from automatically generated summaries of documents containing medical information aimed at general public and medical professionals, found to be relevant to 50 topics provided for the CLEF 2013 eHealth Task 3. Out-of-domain and ungrammatical sentences were manually removed. The development and test setences are provided with information on document ID and topic ID. The topic descriptions are provided as well. The sentences were translated by medical experts into Czech, French, and German. The translations were further reviewed.

For the second subtask, English test queries were randomly sampled from real user query logs provided by the Health on the Net foundation and the Trip database. The queries were translated into Czech, German, and French by medical experts and reviewed.

You may participate in any or all of the following language pairs (both directions):

Czech-English
French-English
German-English

If you use additional training data (beyond the resources listed on this page below) or existing translation systems (e.g. on-line systems), you must indicate upon submission that your system uses additional resources. We will distinguish system submissions that used the provided in-domain training data and the data provided for the standard translation task (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

TRAINING DATA

We provide links to several in-domain data resources that are freely available for research purposes. Some of the resources require user registration and licence agreement. To lower the barrier to entry, we provide a set of easy-to-use scripts to extract parallel data in the plain-text sentence-aligned format and monolingual plain texts for language modelling. See the download section below.

Parallel training data resources

EMEA - a parallel corpus in 22 languages (including Czech, English, French, and German) containing documents from the European Medicines Agency, converted from PDFs and automatically aligned on the sentence level.
COPPA (Corpus of Parallel Patent Applications) - a parallel French-English corpus extracted from Patent Cooperation Treaty applications published between 1990 and 2010. The text is extracted from titles and abstracts, segmented into shorter segments, and aligned. Only texts from patents of categories A61, C12N, and C12P is allowed in the constrained task.
MuchMore - a parallel corpus of approximately 6 thousand German-English abstracts from medical journals published by Springer. The abstracts were downloaded in HTML, normalized, cleaned, and aligned on the sentence level.
PatTR - a parallel German-English and French-English corpus extracted from the MAREC patent collection. It contains aligned sentences collected from titles, abstracts, and claims sections. Only texts from patents of categories A61, C12N, and C12P are allowed in the constrained task.
UMLS (Unified Medical Language System) - a multilingual metathesaurus (including Czech, English, French, and German) of health and biomedical vocabularies and standards. The provided script extracts term-to-term translation dictionaries (optionally also with translation variants).
Wikipedia - a set of parallel article titles from categories manually identified to be medical-domain, cleaned and normalized. (available for Czech, English, French, and German).
WMT14 - parallel training data for the standard translation task.

Monolingual training data resources

AACT - restructured and reformatted English texts publicly available and downloadable from ClinicalTrials.gov.
DrugBank - a bioinformatics and cheminformatics English resource describing drugs.
GENIA - corpus of biomedical literature in English compiled and annotated within the GENIA project.
GREC (Gene Regulation Event Corpus) - a semantically annotated English corpus of abstracts of biomedical texts.
FMA (Foundational Model of Anatomy Ontology) - an English knowledge source for biomedical informatics concerned with symbolic representation of the phenotypic structure of the human body.
PatTR - a parallel German-English and French-English corpus extracted from the MAREC patent collection (see above). Monolingual data contains sentences from descriptions section. Only texts from patents of categories A61, C12N, and C12P are allowed in the constrained task.
PIL (Patient Information Leaflet Corpus) - a collection of English documents giving instructions to patients about their medication.
UMLS (Unified Medical Language System) - a metathesaurus of health and biomedical vocabularies and standards (see above). Monolingual texts are extracted from term descriptions (in Czech, English, French, and German).
Wikipedia - a set of articles from categories manually identified to be medical-domain, cleaned and normalized. (available for Czech, English, French, and German).
WMT14 - monolingual training data for the standard translation task.

DEVELOPMENT AND TEST DATA

The summary translation development and test data (khresmoi-summary) is available here.
The query translation development and test data (khresmoi-query) is available here.
The CLIR query test data (clir-query) is available here.

The data is provided in plain text format and in an SGML format that suits the NIST scoring tool.

DOWNLOAD

Parallel data

Data set	Parallel sentences	Links	Notes
EMEA	1M	CS-EN, DE-EN, FR-EN	Direct download.
COPPA	1.6M	FR-EN	Provided on DVD, data sent on request. The extraction script splits the data into in-domain and out-of-domain.
MuchMore	29K	DE-EN	Direct download (two files!).
PatTR	1.8M-2.2M*	DE-EN, FR-EN	Direct download. The extraction script splits the data into in-domain and out-of-domain.
UMLS	116K-675K*	ALL	Provided upon registration (download the 2013AB Full Release). The script extracts term-to-term translation dictionary.
Wikipedia titles	3K-10K*	CS-EN, DE-EN, FR-EN	Direct download, provided by Charles University in Prague.

* depending on the language pair.

Monolingual data

Corpus	Sentences	Tokens	Links	Notes
AACT	>3.1M	58.7M	EN	Direct download.
DrugBank	23K	826K	EN	Direct download.
GENIA	18K	557K	EN	Direct download.
GREC	1K	62K	EN	Direct download.
FMA	150K	884K	EN	Direct download.
PatTR descriptions	1-1.5M*	38M-52M*	DE-EN, FR-EN	Direct download (the same source as for the parallel data above). The script extracts monolingual sentences from the descriptions section. It splits the data into in-domain and out-of-domain.
PIL	20K	567K	EN	Direct download.
UMLS descriptions	3K-200K*	1K-6.3M*	ALL	Provided upon registration (the same source as for the parallel data above). The script extracts monolingual sentences from term descriptions.
Wikipedia articles	50K-562K*	2M-23M*	EN, CS, DE, FR	Direct download, provided by Charles University in Prague

* depending on the language.

Scripts

A set of scripts that extract plain text sentence-aligned parallel data and plain-text monolingual data for language modelling from the original packages can be downloaded here.

TEST SET SUBMISSION

For intrinsic evaluation (translation quality), convert your output files into the SGML format required by the NIST evaluation tool (see the instructions for the standard task here), and upload your translations of the khresmoi-summary and khresmoi-query test sets (any translation direction) to the matrix.statmt.org.

Go to the website matrix.statmt.org.
Create an account under the menu item Account -> Create Account.
Go to Account -> upload/edit content, and follow the link "Submit a system run"
- select as test set "khresmoi-summary-test-2014" and "khresmoi-query-test-2014" and the language pair you are submitting
- select "create new system"
- click "continue"
- on the next page, upload your file and add some description

The first submitted run will be considered primary. Other runs (if any) will be considered contrastive.

For extrinsic evaluation (cross-lingual information retrieval quality), convert your translations of the clir-query test sets (from any language to English) with 10-best distinct translations of each query in the following format based on the NIST SGML, where each "seg" element has a new "rank" attribute which is an integer value ranging from 1 to 10 and corresposnding to the ranks of the translations variants.

Example:
<seg id="1" rank="1">test translation variant 1</seg> <seg id="1" rank="2">test translation variant 2</seg> ... <seg id="1" rank="10">test translation variant 10</seg>

Submit your results via email to Pavel Pecina (pecina@ufal.mff.cuni.cz). We reserve the right to evaluate a limited number of contrastive submissions from each participant.

EVALUATION

Evaluation will be done automatically using common evaluation metrics. We expect the translated submissions to be in recased, detokenized, SGML format.

DATES

Task announcement	December 12, 2013
Release of development test sets	January 6, 2014
Release of test sets	March 10, 2014
Submission of translations	March 14, 2014
Submission of papers	April 1, 2014

OTHER REQUIREMENTS

You are invited to submit a report about your approach. Your submission report should highlight in which ways your own methods and data differ from the standard approaches.

ACKNOWLEDGEMENTS

We thank all the data providers for granting the license, especially Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and other data providers to provide the summary sentences. We thank the expert translators for translating the data.

CONTACT

For questions, comments, etc. please send an email to Pavel Pecina (pecina@ufal.mff.cuni.cz).

Supported by the European Commision
under the Khresmoi project
(grant number 257528).

ACL 2014 NINTH WORKSHOPON STATISTICAL MACHINE TRANSLATION