EMNLP 2011 SIXTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Featured Translation Task: Translating Haitian Creole Emergency SMS messages

July 30 - 31, 2011
Edinburgh, UK

[HOME] | [TRANSLATION TASK] | [FEATURED TRANSLATION TASK] | [SYSTEM COMBINATION TASK] | [EVALUATION TASK]
[BASELINE SYSTEM] | [BASELINE SYSTEM 2]
[SCHEDULE] | [PAPERS] | [AUTHORS]

The featured translation task of WMT11 is to translate Haitian Creole SMS messages into English. These text messages (SMS) were sent by people in Haiti in the aftermath of the January 2010 earthquake. The messages were sent to an emergency response service and information service called "Mission 4636". They were originally written in Haitian Creole, and were translated into English by a group of volunteers during the disaster response so that first responders (many of whom did not speak Haitian Creole) could understand and act on them. Simultaneously, volunteers were making maps of Haiti and helping to pinpoint the locations described in the messages. More than 30,000 messages were sent to the 4636 number. First responders used the volunteer created translations and maps, and were able to act on the vast majority of requests for help.

Secretary of State Clinton described one success of the Mission 4636 program: "The technology community has set up interactive maps to help us identify needs and target resources. And on Monday, a seven-year-old girl and two women were pulled from the rubble of a collapsed supermarket by an American search-and-rescue team after they sent a text message calling for help." Ushahidi@Tufts described another: "The World Food Program delivered food to an informal camp of 2500 people, having yet to receive food or water, in Diquini to a location that 4636 had identified for them."

In this featured task, we will provide the Haitian Creole SMS messages along with the translations that the volunteers created. We have split the messages into training / dev / devtest / test sets, and have assembled additional out-of-domain parallel corpora.

GOALS

The goals of the Haitian Creole to English translation task are:

We hope that both beginners and established research groups will participate in this task.

TASK DESCRIPTION

We provide data for translating Haitian Creole SMS messages. You may use any of the resources from the standard translation task. The goal is to improve the qualtiy of translating noisy data in a low resource language. You might consider:

Participants will use their systems to translate two test sets consisting of 849 unseen Haitian Creole SMS messages. One of the test sets contains the "raw" SMS messages, and the other contains messages that were cleaned up by human post-editors. The translation quality will measured by a manual evaluation and various automatic evaluation metrics. We hope that the difference in performance on the raw v. cleaned test sets will highlight the importance of handling noisy input data.

TRAINING DATA

We provide the following data:
Training set parallel sentences words per lang Comments / source
In-domain SMS data 17,192 35k This data consists primarially of raw (noisy) SMS data. Courtesy of Mission 4636.
Medical domain 1,619 10k Courtesy of CMU.
Newswire domain 13,517 30k Courtesy of CMU.
Glossary 35,728 85k Courtesy of CMU.
Wikipedia parallel sentence 8,476 90k Data automatically extracted from Wikipedia. Courtesy of MSR.
Wikipedia named entities 10,499 25k Courtesy of MSR.
The bible 30,715 850k Courtesy MSR.
Haitisurf dictionary 3,763 4k Courtesy Haitisurf.com (with assistance from MSR).
Krengle dictionary 1,687 3k Courtesy Krengle.net (with assistance from MSR).
Krengle sentences 658 3k Courtesy Krengle.net (with assistance from MSR).

Please Note: We have anonymized the SMS messages, but in some cases the anonymization may be incorrect or incomplete. Since this is the first release of this data, we are going to control the release a little more closely and ask researchers participating in WMT11 to help identify messages that need to be anonymized. To receive the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).

If you find additional Haitian Creole training data we ask that you add it to the git repository.

In addition to this data, you may use any of the data provided in the standard translation task. You are also welcome to use any linguistic tools such as taggers, parsers, or morphological analyzers.

DEVELOPMENT DATA

Development set parallel sentences words per lang Comments
SMS dev clean 925 12k This set of SMS data was manually cleaned.
SMS dev raw 925 12k This set of SMS data was not manually cleaned. It is parallel to the clean set (the messages are the same but are real, noisy data.)
SMS devtest clean 925 19k This set of SMS data was manually cleaned.
SMS devtest raw 925 19k This set of SMS data was not manually cleaned. It is parallel to devtest clean, but it is the un-cleaned sms messages.

DOWNLOAD

To download the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).

EVALUATION

Evaluation will be done both automatically as well as by human judgement.

DATES

Release of training dataFebruary 4, 2011
Test set distributed for translation taskMarch 14, 2011
Submission deadline for translation taskMarch 18, 2011
Paper due dateMay 19, 2011

OTHER REQUIREMENTS

You are invited to submit a report about your approach. Your submission report should highlight in which ways your own methods and data differ from the standard approaches.

As with the other tasks, participants agree to contribute to the manual evaluation about eight hours of work.

ACKNOWLEDGEMENTS

We thank Rob Munro and Mission for providing this unique data for scientifc study. We thank the Microsoft Translator team at Microsoft Research (especially Will Lewis) for sponsoring the Haitian Creole-English translation task. They generously provided cleaned and re-translated SMS content, negotiated for additional data that could be used for the workshop on our behalf, and helped with defining the scope of the task. Thanks to CMU for providing further training data.

supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7