Parallel Corpus Filtering and Alignment for Low-Resource Conditions Task - ACL 2019 Fourth Conference on Machine Translation

Shared Task: Parallel Corpus Filtering and Alignment for Low-Resource Conditions

This is the third instance of a shared task on assessing the quality of sentence pairs in a parallel corpus.

In the WMT18 shared task on parallel corpus filtering, we posed the challenge of a noisy web-crawled parallel corpus for German-English and asked participants to score each sentence pair. These quality scores were used to select subsets of the corpus, consisting of the highest-scoring sentence pairs, train statistical and neural machine translation systems on them, and evaluate these on a set of test sets.
In the WMT19 shared task on parallel corpus filtering for low resource conditions, we followed the same protocol, but this time for Nepali-English and Sinhala-English. For low-resource language pairs like these, both existing clean parallel corpora and the to-be-scored noisy web-crawled data comes in smaller amounts and lower quality.

This year, we pose two different language pairs, Khmer-English and Pashto-English. In addition to the task of computing quality scores for the purpose of filtering, we also allow for the re-alignment of sentence pairs from document pairs.

DETAILS

We provide a very noisy 58.3 million-word (English token count) Khmer-English corpus and a 11.6 million-word Pashto-English corpus. These corpora were partly crawled from the web as part of the Paracrawl project, and partly extracted from the CommonCrawl data set. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million English words. The quality of the resulting subsets is determined by the quality of a neural machine translation system (fairseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations for Khmer-English and Pashto-English.

We also provide clean parallel and monolingual training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

This year, we also provide the document pairs from which the sentence pairs were extracted (using Hunalign and LASER). You may align sentences yourself from these document pairs, thus producing your own set of sentence pairs. If you opt to do this, you have to submit all aligned sentence pairs and their quality scores.

IMPORTANT DATES

Release of raw parallel data	March 28, 2020
Submission deadline for subsampled sets	August 1, 2020
System descriptions due	August 15, 2020
Announcement of results	August 29, 2020
Paper notification	September 29, 2020
Camera-ready for system descriptions	October 10, 2020

REGISTRATION

It is not necessary to register to the shared task before the submission deadline, but it is highly recommended to subscribe to the general WMT 2020 mailing list to get information about revisions and clarifications for the shared task.

RAW CORPUS DOWNLOAD

The raw parallel sentence-aligned corpora consist of 58.3 million words (English token count, Khmer-English) and 11.6 million words (Pashto-English). There are 391,250 document pairs for Khmer-English and 45,312 document pairs for Pashto-English. We also provide LASER similarity scores for the sentence-aligned corpus, a sentence-embedding method that worked well last year.

Sentence-aligned corpora

Language Pair	Sentence Pairs	English tokens	Sentence Pairs	Baseline LASER scores
Khmer-English	4,169,574	58,347,212	wmt20-sent.en-km.xz (201MB)	wmt20-sent.en-km.laser-score.xz (16MB)
Pashto-English	1,022,883	11,551,009	wmt20-sent.en-ps.xz (45MB)	wmt20-sent.en-ps.laser-score.xz (3MB)

The format of the parallel corpora is one sentence pair per line, with the English sentence and the Khmer/Pashto sentence separated by a TAB character.

Document Pairs

Language Pair	Document Pairs
Khmer-English	391,250	wmt20-docs.en-km.xz (578MB)	wmt20-sent-missing-in-docs.en-km.xz (2.8MB)
Pashto-English	45,312	wmt20-docs.en-ps.xz (88MB)	wmt20-sent-missing-in-docs.en-ps.xz (3.0MB)

Unfortunately, some of the document pairs are missing for sentence pairs included in the sentence aligned set. So, if you are running your own document alignment include the sentence pairs from the files wmt20-sent-missing-in-docs.en-km.xz and wmt20-sent-missing-in-docs.en-ps.xz to the sentence pairs that you extract yourself from the document pairs.

The format of the document pairs is one document pair per line, with four fields separated by a TAB character. The four fields are

English URL
Khmer/Pashto URL
English text content
Khmer/Pashto text content

The text is encoded in base64 decoding. It can be decoded into Unicode text with the Unix command base64 -d < IN > OUT. The resulting Unicode text contains line breaks, but participants may apply additional sentence-splitting.

CLEAN PARALLEL AND MONOLINGUAL TRAINING DATA

Parallel Data

We allow the use of the parallel data in the table below.

Language Pair Name Sentence Pairs English tokens Comment

Khmer-English
km-parallel.tgz (18MB) GNOME 56 233 from OPUS, open source software localization

GlobalVoices 793 14,294 from OPUS, citizen journalism

KDE4 120,087 767,919 from OPUS, open source software localization

Tatoeba 748 3,491 from OPUS, crowd-sourced phrases

Ubuntu 6,987 27,413 from OPUS, open source software localization

Bible 54,222 1,176,418 alignment of 2 English with 4 Khmer Bibles

JW300 107,156 1,827,348 originally from OPUS, but re-done sentence alignment with Vecalign, religious texts

Pashto-English
ps-parallel.tgz (2.1MB) GNOME 95,312 277,188 from OPUS, open source software localization

KDE4 3,377 8,881 from OPUS, open source software localization

Tatoeba 31 239 from OPUS, crowd-sourced phrases

Ubuntu 9,645 26,626 from OPUS, open source software localization

Bible 13,432 298,522 alignment of an English with a Pashto Bible

TED Talks 664 11,157 created for this task, crawled from TED web site, sentence alignment with Vecalign

Wikimedia 737 37,566 from OPUS, Wikipedia translations from Wikimedia foundation

The corpora are broken up by type, and come in Moses format (two files, aligned sentences at the same line number).

Monolingual Data

You may also use the following monolingual data from CommonCrawl.

Language Corpus Sentences

English CommonCrawl 1,806,450,728 cc60_with_url_v2.en_XX_filtered.xz (72 GB)

Wikipedia 67,796,935 wikipedia.en.lid_filtered.test_filtered.xz (3.2 GB)

Pashto CommonCrawl 6,558,180 cc60_with_url_v2.ps_AF_filtered.xz (277 MB)

Wikipedia 76,557 wikipedia.ps.lid_filtered.test_filtered.xz (5.9 MB)

Khmer CommonCrawl 13,832,947 cc60_with_url_v2.km_XX_filtered.xz (614 MB)

Wikipedia 132,666 wikipedia.km.lid_filtered.test_filtered.xz (12 MB)

Acknowledgements

Bibles were provided by Arya McCarthy and David Yarowsky.

DEVELOPMENT ENVIRONMENT

Evaluation of the quality scores will be done by subsampling 5m word corpora based on these scores, training a neural machine translation systems with these subsets, and evaluating translation quality on blind test sets using the BLEU score (sacrebleu).

For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.

Download development tools

The development tools consists of

A script to subsample corpora based on quality scores
fairseq scripts (neural machine translation)
MBART baseline models
FloRes-dev as development set
FLoRes-devtest as development test set

In the following code examples, we assumed that you downloaded and extracted the development tools, and then set the environment variable DEV_TOOLS to that directory, e.g.,

wget http://data.statmt.org/wmt20/filtering-task/dev-tools.tgz
tar xzf dev-tools.tgz
export DEV_TOOLS=`pwd`/dev-tools

Subsampling the corpus

Given your file with sentence-level quality scores, the script subselect.perl allows you to subsample sets with 5 million English tokens.

The syntax to use the script is:

subselect.perl FILE_SCORE FILE_F FILE_E OUT

This will typically look something like this for Pashto-English:

subselect.perl my-score-file.txt wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsample

resulting in files with roughly the following properties

% wc subsample.5000000*
  225725  4979904 31063226 subsample.5000000.en
  225725   550988 44420879 subsample.5000000.ps

To try this on the provided LASER scores (this should result in the file sizes above), execute the following commands.

wget http://data.statmt.org/wmt20/filtering-task/wmt20-sent.en-ps.laser-score.xz
xz -d wmt20-sent.en-ps.laser-score.xz
wget http://data.statmt.org/wmt20/filtering-task/ps-km/wmt20-sent.en-ps.xz
xzcat wmt20-sent.en-ps.xz | cut -f 1 > wmt20-sent.en-ps.en
xzcat wmt20-sent.en-ps.xz | cut -f 2 > wmt20-sent.en-ps.ps
$DEV_TOOLS/subselect.perl wmt20-sent.en-ps.laser-score wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsample

Building a fairseq system

To install fairseq you can follow the instructions in the fairseq github repository. Once this is done, set the environment variable FAIRSEQ to the directory into which you cloned the github repository, e.g.,

git clone https://github.com/pytorch/fairseq.git
cd fairseq
export FAIRSEQ=`pwd`

To train and test a system on subsampled data, first preprocess the data with

$DEV_TOOLS/train-from-scratch/prepare.sh LANGUAGE SYSTEM_DIR SUBSET_STEM

where

$DEV_TOOS is the location of the provided dev-tools package (see above)
LANGUAGE is either km or ps
SYSTEM_DIR is the directory where experimental data is stored
SUBSET_STEM is the file stem (without the language extensions) of the filtered parallel corpus

Then, train the system by executing the following command in DIR directory specificed above.

cd $SYSTEM_DIR
bash train.sh

After training, you can test performance on the development test set with

cd $SYSTEM_DIR
bash translate.sh

Here is the sequence of commands for the example corpus:

$DEV_TOOLS/train-from-scratch/prepare.sh ps example-system subsample.5000000
cd example-system 
bash train.sh
bash translate.sh

Fine-tuning an MBART model

As an alternative evaluation method for the filtered parallel corpus, we provide a pre-trained model that needs to be fine-tuned with the filtered parallel corpus. The pre-training was done on the monolingual data using a method called MBART. For more details on this pre-training, please consult the arxiv paper.

The evaluation via fine-tuning is faster and yields higher BLEU scores. To carry this out with the provided development tools (which include the Khmer and Pashto pre-trained MBART models), simply uses the corresponding scripts in the directory train-mbart instead of train-from-scratch, e.g.,

$DEV_TOOLS/train-mbart/prepare-mbart.sh ps example-mbart-system subsample.5000000
cd example-mbart-system 
bash train-mbart.sh
bash translate-mbart.sh

Baseline results

With the provided LASER-based scores, you should obtain the following BLEU scores on the development test set.

Language Training from scratch MBART fine tuning

Khmer 7.1 10.4

Pashto 9.6 12.2

We noticed that with different GPU hardware, different scores (±1 BLEU point) are obtained from these sets. There is also some variance with different seeds. While you may observe different numbers, all final scoring will be done on identical hardware for all participants to ensure fair assessments.

SUBMISSIONS

To participate in the shared task, you have two choices

Just sentence pair filtering: submit a file with quality scores, one per line, corresponding to the sentence pairs. The scores do not have to be meaningful, except that higher scores indicate better quality.
Sentence pair alignment and filtering: submit a file that contains the sentence pairs and a quality score for each sentence pair. The format is one sentence pair per line, with (1) English sentence, (2) Khmer/Pashto sentence and (3) quality score separated by a TAB character.

Upload the file to the Google Drive folder. Please indicate in the file name clearly your affiliation and send an email to phi@jhu.edu to announce your submission.

TEST SETS AND RESULTS

Preliminary results will be announced on July 29, 2020. The official results will be published in an overview paper at the WMT 2020 Conference for Machine Translation.

FREQUENTLY ASKED QUESTIONS

What data resources and tools can be used?

Any standard linguistic tools (POS taggers, parsers, etc.) may be used. This includes tools with pre-trained models (BERT, LASER, etc.). But no additional parallel and monolingual data is allowed - only the data referred to above.

Should sentences be scored in isolation?

It is not required to score each sentence independent from others. You may consider scoring that take data redundancy into account, i.e., scores the second occurrence of a very similar sentence pair lower.

ORGANIZERS

Philipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Ahmed Kishky, Facebook
Naman Goyal, Facebook
Peng-Jen Chen, Facebook

ACKNOWLEDGEMENTS

This shared task is partially supported by Facebook and Paracrawl.

Language Pair	Name	Sentence Pairs	English tokens	Comment
Khmer-English km-parallel.tgz (18MB)	GNOME	56	233	from OPUS, open source software localization
	GlobalVoices	793	14,294	from OPUS, citizen journalism
	KDE4	120,087	767,919	from OPUS, open source software localization
	Tatoeba	748	3,491	from OPUS, crowd-sourced phrases
	Ubuntu	6,987	27,413	from OPUS, open source software localization
	Bible	54,222	1,176,418	alignment of 2 English with 4 Khmer Bibles
	JW300	107,156	1,827,348	originally from OPUS, but re-done sentence alignment with Vecalign, religious texts
Pashto-English ps-parallel.tgz (2.1MB)	GNOME	95,312	277,188	from OPUS, open source software localization
	KDE4	3,377	8,881	from OPUS, open source software localization
	Tatoeba	31	239	from OPUS, crowd-sourced phrases
	Ubuntu	9,645	26,626	from OPUS, open source software localization
	Bible	13,432	298,522	alignment of an English with a Pashto Bible
	TED Talks	664	11,157	created for this task, crawled from TED web site, sentence alignment with Vecalign
	Wikimedia	737	37,566	from OPUS, Wikipedia translations from Wikimedia foundation

Language	Corpus	Sentences
English	CommonCrawl	1,806,450,728	cc60_with_url_v2.en_XX_filtered.xz (72 GB)
English	Wikipedia	67,796,935	wikipedia.en.lid_filtered.test_filtered.xz (3.2 GB)
Pashto	CommonCrawl	6,558,180	cc60_with_url_v2.ps_AF_filtered.xz (277 MB)
Pashto	Wikipedia	76,557	wikipedia.ps.lid_filtered.test_filtered.xz (5.9 MB)
Khmer	CommonCrawl	13,832,947	cc60_with_url_v2.km_XX_filtered.xz (614 MB)
Khmer	Wikipedia	132,666	wikipedia.km.lid_filtered.test_filtered.xz (12 MB)

Language	Training from scratch	MBART fine tuning
Khmer	7.1	10.4
Pashto	9.6	12.2