Shared Task: Code-mixed Machine Translation (MixMT)
Overview
The mixing of words and phrases from two different languages in a single utterance of text or speech is a frequently observed phenomenon in multilingual communities such as India and Spain. This pattern of communication is broadly categorized as code-mixing or code-switching. In this shared task, we are running two subtasks involving a code-mixed language i.e. Hinglish (code-mixing of Hindi and English). A brief description of both the subtasks is as given below:
- Monolingual to code-mixed machine translation (Subtask-1): In this subtask, Hindi and English are the two source languages and the target language is Hinglish. The source Hindi and English sentences are translations of each other. The Hindi language sentences are written in the Devanagari script whereas the target Hinglish language text is written in the Roman script.
- Code-mixed to monolingual machine translation (Subtask-2): In this subtask, Hinglish is the source language and the target language is English. Both the English and Hinglish text are written in the Roman script.
The shared task is hosted at Codalab. Please follow the following important guidelines:
- If you are participating in the competition, please register your team by filling out this form.
- By participating in the competition you agree to submit a detailed report of your system using the WMT 2022 submission guidelines.
Important dates
The training + validation phase starts |
Apr 01, 2022
|
The training + validation phase ends |
Jun 30, 2022
|
The test phase starts |
Jul 1, 2022
|
The test phase ends |
Jul 30, 2022
|
Paper submission deadline |
Sept 7, 2022
|
Paper notification |
Oct 9, 2022
|
Camera-ready deadline |
Oct 16, 2022
|
All deadlines are in AoE (Anywhere on Earth).
Note:
The system description papers should follow the paper submission policy in WMT, please see the section of paper submission information in WMT homepage for more details.
Training datasets
We provide the following training datasets for both the subtasks:
Subtask-1: Monolingual to code-mixed machine translation
For this subtask, HinGE is the primary training dataset. This dataset is part of an ongoing shared task (HinglishEval) at INLG 2022. We provide the available training and validation dataset at HinglishEval for training the machine translation system for this subtask. We strongly recommend the participating teams to read the dataset description [here] of the HinglishEval task for a better understanding of the dataset format. The download links for the datasets are:
Synthetic dataset:
Download the training and validation data.
Human-generated dataset:
Download the training and validation data.
Subtask-2: Code-mixed to monolingual machine translation
For this subtask, PHINC is the primary training dataset. It contains 13,738 parallel sentences in the Hinglish and the English languages. [Download]
Evaluation Metrics
We use two evaluation metrics for both the subtasks: ROUGE-L (F1-score) and Word Error Rate (WER).
Baselines
We use Google Translate as a baseline for both the subtasks. In subtask-1, we translate Hindi sentences (in Devanagari script) into the English language and evaluate against the reference Hinglish sentences. In subtask-2, we translate the Hinglish sentences into English by setting the language of the Hinglish text as Hindi.
Additional Resources
The participating teams are allowed and encouraged to use external datasets for both subtasks. Some of the references to get the external datasets are
Submission Requirements
Please note that we only allow submissions that have attempted both subtasks. Submissions with a solution for only one subtask are not allowed. The following steps need to be followed to create the submission file in both phases:
- Save the prediction results for subtask 1 in a text file.
- Append the prediction results for subtask 2 in the same text file.
- Rename the text file as submission_val.txt (validation phase) or submission_test.text (test phase).
- Zip the file and rename it as submission_val.zip (validation phase) or submission_test.zip (test phase).
- Submit the zip file.
Sample Submission Files
Organizers
- Vivek Srivastava (TCS Research, India)
- Mayank Singh (IIT Gandhinagar, India)
Contact
Feel free to contact for any questions by dropping an email to organizers.