Shared Task: Unsupervised MT and Very Low Resource Supervised MT

TASK DESCRIPTION

There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.

Like last year, the task includes building supervised and unsupervised translation systems of Upper and Lower Sorbian.

Lower and Upper Sorbian are Slavic minority languages spoken in the Eastern part of Germany with 7k and 30k native speakers respectively. The data for this task was provided by the Sorbian Institute (monolingual data) and The Witaj Sprachzentrum (Witaj Language Center) (both parallel and monolingual data).

LOW RESOURCE SUPERVISED TRANSLATION

All combinations between Upper/Lower Sorbian and German:

All Upper/Lower Sorbian data (both monolingual and parallel) we release may be used.

In addition, we allow the use of all German, Czech and Polish data released for WMT. Parallel data with German on one side (German-Czech and German-Polish) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.

UNSUPERVISED TRANSLATION

All combinations between Upper/Lower Sorbian and German:

All Upper/Lower Sorbian data we release may be used, with the exception of the parallel Upper Sorbian Lower Sorbian corpus.

Furthermore, the German side of the parallel German Upper Sorbian and German Lower Sorbian training corpora may not be used.

In addition, we allow the use of all German, Czech and Polish data released for WMT. Parallel data with German on one side (German-Czech and German-Polish) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.

DATA

Parallel Lower Sorbian German Data

Development and development test data (2021): devtest.dsb-de.tgz
Development and development test data (2022): valid.de.gz, valid.dsb.gz

Please do not use the blind test data from last year.

Monolingual Lower Sorbian Data


Parallel Upper Sorbian German data

Dev and test data (2020): devtest.tar.gz
Dev and test data (2022): HSB-DE_dev.tsv.gz

Please do not use the blind test data from the previous years.

Monolingual Upper Sorbian Data


Parallel Upper Sorbian Lower Sorbian Data

Dev and test data (2022): devtest_dsb_hsb_2022.tar.gz


Additional German, Czech and Polish data

See the news translation task web page (also previous years) for additional data.

BLIND TEST SET SUBMISSION

We updated the DE-HSB test set (August 24th). Please check that you use the correct version.

Use OCELoT to download the test sets and to submit the translations.

After registering your team, please send a mail to Marion Di Marco (lastname -AT- cis.uni-muenchen.de) to verify your registration.

Please also refer to news translation task for more details on the submission process.

When submitting the test sets, please select the respective entry starting with "unsup22" for unsupervised systems and "lowres22" for supervised systems from the "Test set" field.

(Many thanks to Tom Kocmi, Christian Federmann and Roman Grundkiewicz for making Ocelot work.)

OVERVIEW AUTOMATIC METRICS

This document contains a first overview of the BLEU and chrF2 scores of the primary submissions: primary-submissions-overview.pdf

We plan to update this document with more information.

EVALUATION

At present, we plan to use automatic metrics for the evaluation of this task. We believe that manual evaluation may not be so necessary for unsupervised MT and very low resource MT development, because automatic metrics worked well at this (relatively low) translation quality level in the past. We may reconsider this.

IMPORTANT DATES

Release of training/dev/test data 22nd June 2022
Release of blind test data 22nd August 2022
Translation submission deadline 29th August 2022
Paper submission deadline 7th September 2022

Please be aware that the translation submission deadline is very close to the paper submission deadline in on 7th September!

ORGANIZERS

  • Alexander Fraser - CIS, LMU Munich
  • Marion Di Marco - CIS, LMU Munich
  • Marko Měškank - Witaj Sprachzentrum
  • Olaf Langner - Witaj Sprachzentrum
  • Hauke Bartels - Sorbian Institute
  • Marcin Szczepański - Sorbian Institute
  • Questions or comments can be posted for discussion at wmt-tasks@googlegroups.com.

    Organizational issues can be directed to Marion Di Marco and Alexander Fraser.

    ACKNOWLEDGMENTS

    This work was supported by DFG (grant FR 2829/4-1).