The featured translation task of WMT11 is to translate Haitian Creole SMS messages into English. These text messages (SMS) were sent by people in Haiti in the aftermath of the January 2010 earthquake. The messages were sent to an emergency response service and information service called "Mission 4636". They were originally written in Haitian Creole, and were translated into English by a group of volunteers during the disaster response so that first responders (many of whom did not speak Haitian Creole) could understand and act on them. Simultaneously, volunteers were making maps of Haiti and helping to pinpoint the locations described in the messages. More than 30,000 messages were sent to the 4636 number. First responders used the volunteer created translations and maps, and were able to act on the vast majority of requests for help.
Secretary of State Clinton described one success of the Mission 4636 program: "The technology community has set up interactive maps to help us identify needs and target resources. And on Monday, a seven-year-old girl and two women were pulled from the rubble of a collapsed supermarket by an American search-and-rescue team after they sent a text message calling for help." Ushahidi@Tufts described another: "The World Food Program delivered food to an informal camp of 2500 people, having yet to receive food or water, in Diquini to a location that 4636 had identified for them."
In this featured task, we will provide the Haitian Creole SMS messages along with the translations that the volunteers created. We have split the messages into training / dev / devtest / test sets, and have assembled additional out-of-domain parallel corpora.
The goals of the Haitian Creole to English translation task are:
We provide data for translating Haitian Creole SMS messages. You may use any of the resources from the standard translation task. The goal is to improve the qualtiy of translating noisy data in a low resource language. You might consider:
We provide the following data:
Training set | parallel sentences | words per lang | Comments / source |
In-domain SMS data | 17,192 | 35k | This data consists primarially of raw (noisy) SMS data. Courtesy of Mission 4636. |
Medical domain | 1,619 | 10k | Courtesy of CMU. |
Newswire domain | 13,517 | 30k | Courtesy of CMU. |
Glossary | 35,728 | 85k | Courtesy of CMU. |
Wikipedia parallel sentence | 8,476 | 90k | Data automatically extracted from Wikipedia. Courtesy of MSR. |
Wikipedia named entities | 10,499 | 25k | Courtesy of MSR. |
The bible | 30,715 | 850k | Courtesy MSR. |
Haitisurf dictionary | 3,763 | 4k | Courtesy Haitisurf.com (with assistance from MSR). |
Krengle dictionary | 1,687 | 3k | Courtesy Krengle.net (with assistance from MSR). |
Krengle sentences | 658 | 3k | Courtesy Krengle.net (with assistance from MSR). |
Please Note: We have anonymized the SMS messages, but in some cases the anonymization may be incorrect or incomplete. Since this is the first release of this data, we are going to control the release a little more closely and ask researchers participating in WMT11 to help identify messages that need to be anonymized. To receive the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).
If you find additional Haitian Creole training data we ask that you add it to the git repository.
In addition to this data, you may use any of the data provided in the standard translation task. You are also welcome to use any linguistic tools such as taggers, parsers, or morphological analyzers.
Development set | parallel sentences | words per lang | Comments |
SMS dev clean | 925 | 12k | This set of SMS data was manually cleaned. |
SMS dev raw | 925 | 12k | This set of SMS data was not manually cleaned. It is parallel to the clean set (the messages are the same but are real, noisy data.) |
SMS devtest clean | 925 | 19k | This set of SMS data was manually cleaned. |
SMS devtest raw | 925 | 19k | This set of SMS data was not manually cleaned. It is parallel to devtest clean, but it is the un-cleaned sms messages. |
To download the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).
Evaluation will be done both automatically as well as by human judgement.
Release of training data | February 4, 2011 |
Test set distributed for translation task | March 14, 2011 |
Submission deadline for translation task | March 18, 2011 |
Paper due date | May 19, 2011 |
You are invited to submit a report about your approach. Your submission report should highlight in which ways your own methods and data differ from the standard approaches.
As with the other tasks, participants agree to contribute to the manual evaluation about eight hours of work.
We thank Rob Munro and Mission for providing this unique data for scientifc study. We thank the Microsoft Translator team at Microsoft Research (especially Will Lewis) for sponsoring the Haitian Creole-English translation task. They generously provided cleaned and re-translated SMS content, negotiated for additional data that could be used for the workshop on our behalf, and helped with defining the scope of the task. Thanks to CMU for providing further training data.
supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7