This shared task focuses on the automatic methods for translation suggestion (TS), which provides alternatives for the incorrect span of the MT sentence automatically. Translation suggestion is an important tool for computer-aided translation and has proven its ability in improving the efficiency of post-editing (PE). There are two main pitfalls for conventional works in this area:
Our specific goals are:
For all tasks, the datasets and NMT models that generate the translations are publicly available.
Participants are also allowed to use the publicly available pre-trained models and explore any corpus (monolingual or bilingual) provided by WMT22 general translation task, but the resources should be disclosed in their system descriptions.
Release of training and dev data | April 25th, 2022 |
Release of test data | June 29th, 2022 |
Submission deadline | July 8th, 2022 |
System descriptions deadline | September 1st, 2022 |
Paper notification | October 6th, 2022 |
Camera-ready deadline | October 15th, 2022 |
Note: The system description papers should follow the paper submission policy in WMT, plese see the section of paper submission information in WMT homepage for more details. All deadlines are 11:59 PM UTC+8.
This task offers the human-labeled golden data for 4 translation directions: Chinese-English (Zh-En), English-Chinese (En-Zh), English-German (En-De) and German-English (De-En). The datasets are collected by translating the sampled source sentences with the SOTA Transformer NMT model and then annotated by professional translators. The detailed descriptions about the data collection can be found at WeTs setup. Each sample includes the source sentence, MT sentence, the incorrect span of the MT sentence, and the top-1 suggestion.
Training and dev data: Download the training, development data .
Test data: The participants are expected to submit their results of the test set. You can download the test data here. The following is a quick guide to the statistics of the corpus
Train | Dev | Test | |
---|---|---|---|
En-De | 12000 | 2000 | 1000 |
De-En | 10000 | 2000 | 1000 |
En-Zh | 15000 | 2700 | 1000 |
Zh-En | 15000 | 2700 | 1000 |
Baselines: The baseline system is a conventional Transformer model implemented by fairseq toolkit. For the baseline system, the input to the Transformer encoder is the concatenation of the source and MT sentences where the incorrect span of the MT sentence is replaced with a special placeholder token.
Evaluation: Each submission will be evaluated in terms of the document-level BLEU scores for the top-1 suggestion against the reference sentences. We use the official evaluation tool scarebleu . For Chinese, the BLEU score is calculated on the characters with the default tokenizer for Chinese; For English and German, the BLEU score is calculated on the case-sensitive words with the default tokenizer 13a.
sacrebleu -t ref.txt -i hyp.detok.txt -l en-de
sacrebleu -t ref.txt -i hyp.detok.txt -l de-en
sacrebleu -t ref.txt -i hyp.detok.txt -l zh-en
sacrebleu -t ref.txt -i hyp.detok.txt -l en-zh
Compared to task 1, the difference is that we also provide the model with some hints, which can be useful for the model to give more correct suggestions. For this task, each sample includes the source sentence, MT sentence, the incorrect span of the MT sentence, hints for top-1 suggestion, and the top-1 suggestion. The hints are generated automatically following WeTs setup. Note: The hints used here are somewhat different from that used in WeTs. We only take the first-k initial characters as the hints and the k is randomly sampled.
Training and dev data: Download the training, development data .
Test data: The participants are expected to submit their results of the test set. You can download the test data here.
Baselines: The baseline system is a conventional Transformer model implemented by fairseq toolkit. For the baseline system, the input to the Transformer encoder is the concatenation of the source sentence, MT sentence and the hint, where the incorrect span of the MT sentence is replaced with a special placeholder token.
Evaluation: Each submission will be evaluated in terms of the document-level BLEU scores for the top-1 suggestion against the reference sentences. We use the official evaluation tool scarebleu . We only provide corpus for the translation directions of English2Chinese and Chinese2English. For Chinese, the BLEU score is calculated on the characters with the default tokenizer for Chinese; For English, the BLEU score is calculated on the case-sensitive words with the default tokenizer 13a.
sacrebleu -t ref.txt -i hyp.detok.txt -l zh-en
sacrebleu -t ref.txt -i hyp.detok.txt -l en-zh
Attention: All training, dev and test sets are subject to the corpus provided by this website. If it helps, you can download the NMT models which were used to generate the MT sentences of our corpus.
Each participating team can submit at most 15 systems for each of the translation directions of each subtask. The participants can submit their results and scan scores by the Website . Before submitting, the participants are required to sign up by dropping an email to the organizers, which should includes the following information: user name, passwd, team name, organization, and email.
Feel free to contact us for any questions by dropping an email to Zhen Yang.