There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.
Like last year, the task includes building supervised and unsupervised translation systems of Upper and Lower Sorbian.
Lower and Upper Sorbian are Slavic minority languages spoken in the Eastern part of Germany with 7k and 30k native speakers respectively. The data for this task was provided by the Sorbian Institute (monolingual data) and The Witaj Sprachzentrum (Witaj Language Center) (both parallel and monolingual data).
All Upper/Lower Sorbian data (both monolingual and parallel) we release may be used.
In addition, we allow the use of all German, Czech and Polish data released for WMT. Parallel data with German on one side (German-Czech and German-Polish) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.
All Upper/Lower Sorbian data we release may be used, with the exception of the parallel Upper Sorbian ↔ Lower Sorbian corpus.
Furthermore, the German side of the parallel German ↔ Upper Sorbian and German ↔ Lower Sorbian training corpora may not be used.In addition, we allow the use of all German, Czech and Polish data released for WMT. Parallel data with German on one side (German-Czech and German-Polish) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.
Development and development test data (2021): devtest.dsb-de.tgz
Development and development test data (2022): valid.de.gz, valid.dsb.gz
Please do not use the blind test data from last year.
Dev and test data (2020): devtest.tar.gz
Dev and test data (2022): HSB-DE_dev.tsv.gz
Please do not use the blind test data from the previous years.
Dev and test data (2022): devtest_dsb_hsb_2022.tar.gz
See the news translation task web page (also previous years) for additional data.
We updated the DE-HSB test set (August 24th). Please check that you use the correct version.
Use OCELoT to download the test sets and to submit the translations.
After registering your team, please send a mail to Marion Di Marco (lastname -AT- cis.uni-muenchen.de) to verify your registration.
Please also refer to news translation task for more details on the submission process.
When submitting the test sets, please select the respective entry starting with "unsup22" for unsupervised systems and "lowres22" for supervised systems from the "Test set" field.
(Many thanks to Tom Kocmi, Christian Federmann and Roman Grundkiewicz for making Ocelot work.)
This document contains a first overview of the BLEU and chrF2 scores of the primary submissions: primary-submissions-overview.pdf
We plan to update this document with more information.Release of training/dev/test data | 22nd June 2022 |
Release of blind test data | 22nd August 2022 |
Translation submission deadline | 29th August 2022 |
Paper submission deadline | 7th September 2022 |
Please be aware that the translation submission deadline is very close to the paper submission deadline in on 7th September!
Questions or comments can be posted for discussion at wmt-tasks@googlegroups.com.
Organizational issues can be directed to Marion Di Marco and Alexander Fraser.
This work was supported by DFG (grant FR 2829/4-1).