This is shared task is aimed at the generation of image descriptions in a target language. The task can be addressed as a translation task, which will take a source language description and translate it into the target language, where this process can be supported by information from the image (multimodal translation), and as a multisource multimodal translation task, which takes source language descriptions in multiple languages and translates them into the target language, using the visual information as additional context.
This shared task has the following main goals:We welcome participation from experienced and new participants. We would also particularly like to encourage participants to consider the unconstrained data setting for both tasks. Participants agree to contribute to the manual evaluation of approximately eight hours of work, per system submission.
Release of training data: | February 12, 2018 |
Release of test data: | June 8, 2018 |
Results submission deadline: | June 15, 2018 |
Start of manual evaluation: | June 20, 2018 |
End of manual evaluation: | July 20, 2018 |
NEW: Download all pre-processed submissions.
NEW: Please submit your (pre-processed - link above) WMT18 and any new submissions to CODALAB.
This task consists in translating English sentences that describe an image into German or French or Czech, given the English sentence itself and the image that it describes (or features from this image, if participants chose to). See Specia et al. (2016) and Elliott et al. (2017) for descriptions of previous editions of this task at WMT16 and 17.
The original data for this task was created by extending the Flickr30K Entities dataset in the following way: for each image, one of the English descriptions was selected and manually translated into German, French, and Czech by human translators. For English-German, translations were produced by professional translators, who were given the source segment only (training set) or the source segment and image (validation and test sets). For English-French, translations were produced via crowd-sourcing where translators had access to source segment, the image and an automatic translation created with a standard phrase-based system (Moses baseline system built using the WMT'15 constrained translation task data) as a suggestion to make translation easier (note that this was not a post-editing task: although translators could copy and paste the suggested translation to edit, we found that they did not do so in the vast majority of cases). For English-Czech, the translations were produced by crowd-sourcing where translators had access to the source segment and the image.
Summary of the datasets:
Training | Validation | Test 2016 | Test 2017 | Ambiguous COCO | Test 2018 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Images | Sentences | Images | Sentences | Images | Sentences | Images | Sentences | Images | Sentences | Images | Sentences |
29,000 | 29,000 | 1,014 | 1,014 | 1,000 | 1,000 | 1,000 | 1,000 | 461 | 461 | 1,071 | 1,071 |
As training and development data, we provide 29,000, and 1,014 triples respectively, each containing an English source sentence, its German, French, and Czech human translations and corresponding image. We also provide the 2016 and 2017 test sets, which people can use for validation and internal evaluation. The English-German datasets are the same as those in 2016, but we note that human translations in the 2016 validation and test datasets have been post-edited (by humans) using the images to make sure the target descriptions are faithful to these images. There were cases where in the 2016 the source text was ambiguous and the image was used to solve the ambiguities. The French translations were added in 2017 and the Czech translations were added in 2018.
As test data, we provide a new test set of 1,071 tuples containing an English description and its corresponding image. Gold labels will be translations in German, Czech, or French.
Evaluation will be performed using human Direct Assesment. The submissions and reference translations will be pre-processed to lowercase, normalise punctuation and tokenise the sentences. Each language will be evaluated independently. If you participate in the shared task, we ask you to perform a defined amount of evaluation per language pair submitted.
This new task consists in translating English sentences that describe an image into Czech, given the English sentence itself, the image that it describes (or features from this image, if participants chose to), and parallel sentences in French and German. Participants are free to use any subset(s) of the additional source language data in their submissions.
Summary of the datasets:
Training | Validation | Test 2016 | Test 2017 | ||||
---|---|---|---|---|---|---|---|
Images | Sentences | Images | Sentences | Images | Sentences | Images | Sentences |
29,000 | 29,000 | 1,014 | 1,014 | 1,000 | 1,000 | 1,000 | 1,000 |
As training and development data, we provide 29,000, and 1,014 triples respectively, each containing an English, French, and German source sentence, and its Czech human translations and corresponding image. We also provide the 2016 validation and test set, which people can use for validation and internal evaluation. The English-German datasets are the same as those in 2016, but we note that human translations in the 2016 validation and test datasets have been post-edited (by humans) using the images to make sure the target descriptions are faithful to these images. There were cases where in the 2016 the source text was ambiguous and the image was used to solve the ambiguities. The French translations were added in 2017 and the Czech translations were added in 2018.
As test data, we provide a test set of 1,000 tuples containing English, French, and German descriptions and its corresponding image. Gold labels will be translations in Czech. This test set corresponds to the unseen portion of the Czech Test 2017 data.
Evaluation will be performed using human Direct Assesment. The submissions and reference translations will be pre-processed to lowercase, normalise punctuation and tokenise the sentences. If you participate in the shared task, we ask you to perform a defined amount of evaluation per language pair submitted.
All of the textual data can be downloaded from the Multi30K Github repository. We also provide example data pre-processing scripts; their use is not mandatory.
We also provide ResNet-50 image features, although their use is not mandatory. The image features can be downloaded here. The raw images can be requested here for the training, development sets and 2016 test set. For images in the the 2017 and 2018 test files.
We released the new test set on June 8th, 2018:
If you use the datasets created for this shared task, please cite the following papers:
@inproceedings{elliott-EtAl:2016:VL16, author = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.}, title = {Multi30K: Multilingual English-German Image Descriptions}, booktitle = {Proceedings of the 5th Workshop on Vision and Language}, year = {2016}, pages = {70--74}, year = 2016 }
@inproceedings{ElliottFrankBarraultBougaresSpecia2017, author = {Desmond Elliott and Stella Frank and Lo\"{i}c Barrault and Fethi Bougares and Lucia Specia}, title = {{Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description}}, booktitle = {Proceedings of the Second Conference on Machine Translation}, year = {2017}, month = {September}, address = {Copenhagen, Denmark} }
We suggest the following interesting resources that can be used as additional training data for either or both tasks:
Your system description should be a short report (4 to 6 pages) submitted to WMT describing your method(s). We ask you to provide a summary and/or an appropriate reference describing your method(s) that we can cite in the WMT overview paper.
Each participating team can submit at most 2 systems for each of the task variants for each language pair.Submissions should be sent via email to Lucia Specia lspecia@gmail.com.
Please use the following pattern to name your files:
INSTITUTION-NAME
_TASK-NAME
_METHOD-NAME
_TYPE
, where:
INSTITUTION-NAME
is an acronym/short name for your institution, e.g. SHEF
TASK-NAME
is one of the following: 1 (translation) or 1b (multisource multimodal).
METHOD-NAME
is an identifier for your method in case you have multiple methods for the same task, e.g. 1b_MultimodalTranslation, 1b_Moses
TYPE
is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".
For instance, a constrained submission from team SHEF for Task 1b using method "Moses" could be named SHEF_1b_Moses_C.
If you are submitting a system for Task 1, please include the language in the TASK
tag, e.g. 1_FLICKR_DE, 1_FLICKR_FR, etc.
The output of your system a given task should produce a target language description for each image formatted in the following way:
<METHOD NAME> <IMAGE ID> <DESCRIPTION> <TASK> <TYPE>Where:
METHOD NAME
is the name of your method.IMAGE ID
is the identifier of the test image.DESCRIPTION
is the output generated by your system (either a translation or an independently generated description). TASK
is one of the following flags: 1 (for the translation task) or 1b (for the multisource multimodal task).TYPE
is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".For questions or comments, please use the wmt-tasks mailing list.
Supported by the the following European Commission projects: .