Translation Task - EMNLP seventh Conference on Machine Translation

Shared Task: Code-mixed Machine Translation (MixMT)

Overview

The mixing of words and phrases from two different languages in a single utterance of text or speech is a frequently observed phenomenon in multilingual communities such as India and Spain. This pattern of communication is broadly categorized as code-mixing or code-switching. In this shared task, we are running two subtasks involving a code-mixed language i.e. Hinglish (code-mixing of Hindi and English). A brief description of both the subtasks is as given below:

Monolingual to code-mixed machine translation (Subtask-1): In this subtask, Hindi and English are the two source languages and the target language is Hinglish. The source Hindi and English sentences are translations of each other. The Hindi language sentences are written in the Devanagari script whereas the target Hinglish language text is written in the Roman script.
Code-mixed to monolingual machine translation (Subtask-2): In this subtask, Hinglish is the source language and the target language is English. Both the English and Hinglish text are written in the Roman script.

The shared task is hosted at Codalab. Please follow the following important guidelines:

If you are participating in the competition, please register your team by filling out this form.
By participating in the competition you agree to submit a detailed report of your system using the WMT 2022 submission guidelines.

Important dates

The training + validation phase starts	Apr 01, 2022
The training + validation phase ends	Jun 30, 2022
The test phase starts	Jul 1, 2022
The test phase ends	Jul 30, 2022
Paper submission deadline	Sept 7, 2022
Paper notification	Oct 9, 2022
Camera-ready deadline	Oct 16, 2022

All deadlines are in AoE (Anywhere on Earth).

Note: The system description papers should follow the paper submission policy in WMT, please see the section of paper submission information in WMT homepage for more details.

Training datasets

We provide the following training datasets for both the subtasks:

Subtask-1: Monolingual to code-mixed machine translation

For this subtask, HinGE is the primary training dataset. This dataset is part of an ongoing shared task (HinglishEval) at INLG 2022. We provide the available training and validation dataset at HinglishEval for training the machine translation system for this subtask. We strongly recommend the participating teams to read the dataset description [here] of the HinglishEval task for a better understanding of the dataset format. The download links for the datasets are:

Synthetic dataset: Download the training and validation data.

Human-generated dataset: Download the training and validation data.

Subtask-2: Code-mixed to monolingual machine translation

For this subtask, PHINC is the primary training dataset. It contains 13,738 parallel sentences in the Hinglish and the English languages. [Download]

Evaluation Metrics

We use two evaluation metrics for both the subtasks: ROUGE-L (F1-score) and Word Error Rate (WER).

Baselines

We use Google Translate as a baseline for both the subtasks. In subtask-1, we translate Hindi sentences (in Devanagari script) into the English language and evaluate against the reference Hinglish sentences. In subtask-2, we translate the Hinglish sentences into English by setting the language of the Hinglish text as Hindi.

Additional Resources

The participating teams are allowed and encouraged to use external datasets for both subtasks. Some of the references to get the external datasets are

Submission Requirements

Please note that we only allow submissions that have attempted both subtasks. Submissions with a solution for only one subtask are not allowed. The following steps need to be followed to create the submission file in both phases:

Save the prediction results for subtask 1 in a text file.
Append the prediction results for subtask 2 in the same text file.
Rename the text file as submission_val.txt (validation phase) or submission_test.text (test phase).
Zip the file and rename it as submission_val.zip (validation phase) or submission_test.zip (test phase).
Submit the zip file.

Sample Submission Files

Training and validation phase -- sample submission
Test phase -- TBD

Organizers

Vivek Srivastava (TCS Research, India)
Mayank Singh (IIT Gandhinagar, India)

Contact

Feel free to contact for any questions by dropping an email to organizers.