WMT22 Shared Task: Unsupervised MT and Very Low Resource Supervised MT

Shared Task: Unsupervised MT and Very Low Resource Supervised MT

TASK DESCRIPTION

There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.

Like last year, the task includes building supervised and unsupervised translation systems of Upper and Lower Sorbian.

Lower and Upper Sorbian are Slavic minority languages spoken in the Eastern part of Germany with 7k and 30k native speakers respectively. The data for this task was provided by the Sorbian Institute (monolingual data) and The Witaj Sprachzentrum (Witaj Language Center) (both parallel and monolingual data).

LOW RESOURCE SUPERVISED TRANSLATION

All combinations between Upper/Lower Sorbian and German:

Upper Sorbian to/from Lower Sorbian
German to/from Lower Sorbian
German to/from Upper Sorbian

All Upper/Lower Sorbian data (both monolingual and parallel) we release may be used.

In addition, we allow the use of all German, Czech and Polish data released for WMT. Parallel data with German on one side (German-Czech and German-Polish) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.

UNSUPERVISED TRANSLATION

All combinations between Upper/Lower Sorbian and German:

Upper Sorbian to/from Lower Sorbian
German to/from Lower Sorbian
German to/from Upper Sorbian

All Upper/Lower Sorbian data we release may be used, with the exception of the parallel Upper Sorbian ↔ Lower Sorbian corpus.

Furthermore, the German side of the parallel German ↔ Upper Sorbian and German ↔ Lower Sorbian training corpora may not be used.

DATA

Parallel Lower Sorbian ↔ German Data

Parallel Lower Sorbian data (2022): 40194_train_dsb_de.de.gz, 40194_train_dsb_de.dsb.gz

Development and development test data (2021): devtest.dsb-de.tgz
Development and development test data (2022): valid.de.gz, valid.dsb.gz

Please do not use the blind test data from last year.

Monolingual Lower Sorbian Data

Monolingual Lower Sorbian data (2021): mono.dsb.gz
Monolingual Lower Sorbian data (2022): 66408_DSB_monolingual.txt.gz, 8815_DSB_wikipedia_2021.txt.gz

Parallel Upper Sorbian ↔ German data

Parallel data (2020): train.hsb-de.hsb.gz, train.hsb-de.de.gz
Parallel data (2021): train2021.hsb-de.hsb.gz, train2021.hsb-de.de.gz
Parallel data (2022): HSB-DE_train.tsv.gz

Dev and test data (2020): devtest.tar.gz
Dev and test data (2022): HSB-DE_dev.tsv.gz

Please do not use the blind test data from the previous years.

Monolingual Upper Sorbian Data

Monolingual data (2020): sorbian_institute_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Sorbian Institute
(contains a high-quality corpus and some medium quality data which are mixed together).
Monolingual data (2020): witaj_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Witaj Sprachzentrum (high quality).
Monolingual data (2020): web_monolingual.hsb.gz Upper Sorbian monolingual data scraped from the web by CIS, LMU (thanks to Alina Fastowski). Use with caution, probably noisy, might erroneously contain some data from related languages.
Monolingual data (2022): HSB_monolingual.txt.gz

Parallel Upper Sorbian ↔ Lower Sorbian Data

Parallel data (2022): train_dsb_hsb_62564.dsb.gz , train_dsb_hsb_62564.hsb.gz

Dev and test data (2022): devtest_dsb_hsb_2022.tar.gz

Additional German, Czech and Polish data

See the news translation task web page (also previous years) for additional data.

BLIND TEST SET SUBMISSION

We updated the DE-HSB test set (August 24th). Please check that you use the correct version.

Use OCELoT to download the test sets and to submit the translations.

After registering your team, please send a mail to Marion Di Marco (lastname -AT- cis.uni-muenchen.de) to verify your registration.

Please also refer to news translation task for more details on the submission process.

When submitting the test sets, please select the respective entry starting with "unsup22" for unsupervised systems and "lowres22" for supervised systems from the "Test set" field.

(Many thanks to Tom Kocmi, Christian Federmann and Roman Grundkiewicz for making Ocelot work.)

OVERVIEW AUTOMATIC METRICS

This document contains a first overview of the BLEU and chrF2 scores of the primary submissions: primary-submissions-overview.pdf

We plan to update this document with more information.

EVALUATION

At present, we plan to use automatic metrics for the evaluation of this task. We believe that manual evaluation may not be so necessary for unsupervised MT and very low resource MT development, because automatic metrics worked well at this (relatively low) translation quality level in the past. We may reconsider this.

IMPORTANT DATES

Release of training/dev/test data	22nd June 2022
Release of blind test data	22nd August 2022
Translation submission deadline	29th August 2022
Paper submission deadline	7th September 2022

Please be aware that the translation submission deadline is very close to the paper submission deadline in on 7th September!

ORGANIZERS

Alexander Fraser - CIS, LMU Munich

Marion Di Marco - CIS, LMU Munich

Marko Měškank - Witaj Sprachzentrum

Olaf Langner - Witaj Sprachzentrum

Hauke Bartels - Sorbian Institute

Marcin Szczepański - Sorbian Institute

Questions or comments can be posted for discussion at wmt-tasks@googlegroups.com.

Organizational issues can be directed to Marion Di Marco and Alexander Fraser.

ACKNOWLEDGMENTS

This work was supported by DFG (grant FR 2829/4-1).