This is the third instance of a shared task on assessing the quality of sentence pairs in a parallel corpus.
We also provide clean parallel and monolingual training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance.
Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.
The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).
This year, we also provide the document pairs from which the sentence pairs were extracted (using Hunalign and LASER). You may align sentences yourself from these document pairs, thus producing your own set of sentence pairs. If you opt to do this, you have to submit all aligned sentence pairs and their quality scores.
Release of raw parallel data | March 28, 2020 |
Submission deadline for subsampled sets | August 1, 2020 |
System descriptions due | August 15, 2020 |
Announcement of results | August 29, 2020 |
Paper notification | September 29, 2020 |
Camera-ready for system descriptions | October 10, 2020 |
Language Pair | Sentence Pairs | English tokens | Sentence Pairs | Baseline LASER scores |
Khmer-English | 4,169,574 | 58,347,212 | wmt20-sent.en-km.xz (201MB) | wmt20-sent.en-km.laser-score.xz (16MB) |
Pashto-English | 1,022,883 | 11,551,009 | wmt20-sent.en-ps.xz (45MB) | wmt20-sent.en-ps.laser-score.xz (3MB) |
The format of the parallel corpora is one sentence pair per line, with the English sentence and the Khmer/Pashto sentence separated by a TAB character.
Language Pair | Document Pairs | ||
Khmer-English | 391,250 | wmt20-docs.en-km.xz (578MB) | wmt20-sent-missing-in-docs.en-km.xz (2.8MB) |
Pashto-English | 45,312 | wmt20-docs.en-ps.xz (88MB) | wmt20-sent-missing-in-docs.en-ps.xz (3.0MB) |
Unfortunately, some of the document pairs are missing for sentence pairs included in the sentence aligned set. So, if you are running
your own document alignment include the sentence pairs from the files wmt20-sent-missing-in-docs.en-km.xz
and
wmt20-sent-missing-in-docs.en-ps.xz
to the sentence pairs that you extract yourself from the document pairs.
The format of the document pairs is one document pair per line, with four fields separated by a TAB character. The four fields are
base64 -d < IN > OUT
.
The resulting Unicode text contains line breaks, but participants may apply additional sentence-splitting.
Language Pair | Name | Sentence Pairs | English tokens | Comment |
Khmer-English km-parallel.tgz (18MB) |
GNOME | 56 | 233 | from OPUS, open source software localization |
GlobalVoices | 793 | 14,294 | from OPUS, citizen journalism | |
KDE4 | 120,087 | 767,919 | from OPUS, open source software localization | |
Tatoeba | 748 | 3,491 | from OPUS, crowd-sourced phrases | |
Ubuntu | 6,987 | 27,413 | from OPUS, open source software localization | |
Bible | 54,222 | 1,176,418 | alignment of 2 English with 4 Khmer Bibles | |
JW300 | 107,156 | 1,827,348 | originally from OPUS, but re-done sentence alignment with Vecalign, religious texts | |
Pashto-English ps-parallel.tgz (2.1MB) |
GNOME | 95,312 | 277,188 | from OPUS, open source software localization |
KDE4 | 3,377 | 8,881 | from OPUS, open source software localization | |
Tatoeba | 31 | 239 | from OPUS, crowd-sourced phrases | |
Ubuntu | 9,645 | 26,626 | from OPUS, open source software localization | |
Bible | 13,432 | 298,522 | alignment of an English with a Pashto Bible | |
TED Talks | 664 | 11,157 | created for this task, crawled from TED web site, sentence alignment with Vecalign | |
Wikimedia | 737 | 37,566 | from OPUS, Wikipedia translations from Wikimedia foundation |
The corpora are broken up by type, and come in Moses format (two files, aligned sentences at the same line number).
Language | Corpus | Sentences | |
English | CommonCrawl | 1,806,450,728 | cc60_with_url_v2.en_XX_filtered.xz (72 GB) |
Wikipedia | 67,796,935 | wikipedia.en.lid_filtered.test_filtered.xz (3.2 GB) | |
Pashto | CommonCrawl | 6,558,180 | cc60_with_url_v2.ps_AF_filtered.xz (277 MB) |
Wikipedia | 76,557 | wikipedia.ps.lid_filtered.test_filtered.xz (5.9 MB) | |
Khmer | CommonCrawl | 13,832,947 | cc60_with_url_v2.km_XX_filtered.xz (614 MB) |
Wikipedia | 132,666 | wikipedia.km.lid_filtered.test_filtered.xz (12 MB) |
For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.
The development tools consists of
DEV_TOOLS
to that directory, e.g.,
wget http://data.statmt.org/wmt20/filtering-task/dev-tools.tgz tar xzf dev-tools.tgz export DEV_TOOLS=`pwd`/dev-tools
subselect.perl
allows you to subsample sets with 5 million English tokens.
The syntax to use the script is:
subselect.perl FILE_SCORE FILE_F FILE_E OUT
This will typically look something like this for Pashto-English:
subselect.perl my-score-file.txt wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsampleresulting in files with roughly the following properties
% wc subsample.5000000* 225725 4979904 31063226 subsample.5000000.en 225725 550988 44420879 subsample.5000000.psTo try this on the provided LASER scores (this should result in the file sizes above), execute the following commands.
wget http://data.statmt.org/wmt20/filtering-task/wmt20-sent.en-ps.laser-score.xz xz -d wmt20-sent.en-ps.laser-score.xz wget http://data.statmt.org/wmt20/filtering-task/ps-km/wmt20-sent.en-ps.xz xzcat wmt20-sent.en-ps.xz | cut -f 1 > wmt20-sent.en-ps.en xzcat wmt20-sent.en-ps.xz | cut -f 2 > wmt20-sent.en-ps.ps $DEV_TOOLS/subselect.perl wmt20-sent.en-ps.laser-score wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsample
git clone https://github.com/pytorch/fairseq.git cd fairseq export FAIRSEQ=`pwd`To train and test a system on subsampled data, first preprocess the data with
$DEV_TOOLS/train-from-scratch/prepare.sh LANGUAGE SYSTEM_DIR SUBSET_STEMwhere
$DEV_TOOS
is the location of the provided dev-tools package (see above)
LANGUAGE
is either km
or ps
SYSTEM_DIR
is the directory where experimental data is stored
SUBSET_STEM
is the file stem (without the language extensions) of the filtered parallel corpus
DIR
directory specificed above.
cd $SYSTEM_DIR bash train.shAfter training, you can test performance on the development test set with
cd $SYSTEM_DIR bash translate.shHere is the sequence of commands for the example corpus:
$DEV_TOOLS/train-from-scratch/prepare.sh ps example-system subsample.5000000 cd example-system bash train.sh bash translate.sh
The evaluation via fine-tuning is faster and yields higher BLEU scores. To carry this out with the provided
development tools (which include the Khmer and Pashto pre-trained MBART models), simply uses the corresponding
scripts in the directory train-mbart
instead of train-from-scratch
, e.g.,
$DEV_TOOLS/train-mbart/prepare-mbart.sh ps example-mbart-system subsample.5000000 cd example-mbart-system bash train-mbart.sh bash translate-mbart.sh
Language | Training from scratch | MBART fine tuning |
Khmer | 7.1 | 10.4 |
Pashto | 9.6 | 12.2 |
We noticed that with different GPU hardware, different scores (±1 BLEU point) are obtained from these sets. There is also some variance with different seeds. While you may observe different numbers, all final scoring will be done on identical hardware for all participants to ensure fair assessments.
Upload the file to the
Google Drive folder.
Please indicate in the file name
clearly your affiliation and send an email to phi@jhu.edu
to announce
your submission.