Parallel Corpus Filtering for Low-Resource Conditions Task - ACL 2019 Fourth Conference on Machine Translation

Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Following WMT18 shared task on parallel corpus filtering, we now pose the problem under more challenging low-resource conditions. Instead of German-English, this year there are two language pairs, Nepali-English and Sinhala-English.

Otherwise, the shared task follows the same set-up. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high-quality sentence pairs.

DETAILS

Specifically, we provide a very noisy 40.6 million-word (English token count) Nepali-English corpus and a 59.6 million-word Sinhala-English corpus crawled from the web as part of the Paracrawl project. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million and 1 million English words. The quality of the resulting subsets is determined by the quality of a statistical machine translation (Moses, phrase-based) and neural machine translation system (FAIRseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations for Sinhala-English and Nepali-English.

We also provide links to training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance. We provide a script to fetch and compose the training data.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

IMPORTANT DATES

Release of raw parallel data	February 8, 2019
Submission deadline for subsampled sets	May 10, 2019
System descriptions due	May 17, 2019
Announcement of results	June 3, 2019
Paper notification	June 7, 2019
Camera-ready for system descriptions	June 17, 2019

REGISTRATION

It is not necessary to register to the shared task before the submission deadline, but it is highly recommended to subscribe to the general WMT 2019 mailing list to get information about revisions and clarifications for the shared task definition.

RAW CORPUS DOWNLOAD

The raw corpora consist 40.6 million words (English token count, Nepali-English) and 59.6 million words (Sinhala-English) — crawled using the Paracrawl pipeline.

Download raw corpus (327 MB)

UPDATE: Download improved version of Nepali corpus (165M).

The provided tar ball contains the Nepalese-English and Sinhala-English corpus in Moses format, i.e., one sentence pair lines, with corresponding lines in the English and foreign file.

PARALLEL AND MONOLINGUAL TRAINING DATA

We are providing links to the permissible third-party sources of data to be used in the competition in the table below. You can use the script to obtain the clean data. Use of this data may be subject to terms and conditions specified by the third-party source.

Nepali

Corpus	Sentence pairs	English words	Source Files	Comment
Bible (two translations)	61,645	1,507,905	English.xml English-WEB.xml Nepali.xml	The extraction script can be found here
Global Voices	2,892	75,197	Global Voices (all)	Contains many languages. Only use En-Ne
Penn Tree Bank	4,199	88,758	NepaliTaggedCorpus.zip	Corpus needs realigning. Apply patch found here
GNOME / KDE / Ubuntu	494,994	2,018,631	GNOME KDE4 Ubuntu
Nepali Dictionary	9,916	25,058	dictionaries.tar.gz	Link contains all languages

Sinhala

Corpus	Sentence pairs	English words	Source Files	Comment
Open Subtitles	601,164	3,594,769	OPUS-OpenSubtitles18
GNOME / KDE / Ubuntu	45,617	150,513	GNOME KDE4 Ubuntu

Monolingual Data

Here we provide the allowed Wikipedia Monolingual data that has been filtered not to contain any of the documents from the dev/devtest/testsets. You may also use monolingual data from CommonCrawl or monolingual English data from the WMT 2019 News Translation shared task.

Corpus	Sentences	Words	Source files
Filtered Sinhala Wikipedia	155,946	4,695,602	wikipedia.si_filtered.gz
Filtered Nepali Wikipedia	92,296	2,804,439	wikipedia.ne_filtered.gz
Filtered English Wikipedia	67,796,935	1,985,175,324	wikipedia.en_filtered.gz
Filtered Sinhala Common Crawl	5,178,491	110,270,445	commoncrawl.deduped.si.xz
Filtered Nepali Common Crawl	3,562,373	102,988,609	commoncrawl.deduped.ne.xz
Filtered English Common Crawl	380,409,891	8,894,266,960	commoncrawl.deduped.en.xz

Multi-lingual Data

Some might find useful to use additional parallel data coming from related languages (e.g. Hindi). Here we point to additional resources that can be used for this task.

Corpus	Sentences	Words	Source Files
Parallel IITB Hindi-English Corpus	1,492,827	20,667,240	parallel.tgz
Monolingual IITB Hindi Corpus	67,796,935	1,985,175,324	monolingual.hi.tgz

Acknowledgments

Bible was taken from bible-corpus.
Penn Tree Bank translations were provided by Language Resource Association (GSK) of Japan and International Development Research Center (IDRC) of Canada, through PAN Localization project (www.PANL10n.net)
GNOME / KDE / Ubuntu and Open Subtitles was taken from OPUS.
Global Voices (Version 2018q4) was extracted by Philipp Koehn.
The Wikipedia data is a result of an early May 2018 crawl from Wikipedia. The data has been cleaned to prevent leakages.

SUBMISSIONS

To participate in the shared task, you have to submit a file with quality scores, one per line, corresponding to the sentence pairs. The scores do not have to be meaningful, except that higher scores indicate better quality.

Upload the file to the Google Drive folder. Please indicate in the file name clearly your affiliation and send an email to phi@jhu.edu to announce your submission.

DEVELOPMENT ENVIRONMENT

Evaluation of the quality scores will be done by subsampling 5m word corpora based on these scores, training statistical and neural machine translation systems with these corpora, and evaluation translation quality on blind test sets using the BLEU score (sacrebleu).

For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.

Download development pack

The development pack consists of

A script to subsample corpora based on quality scores
Moses configuration files (statistical machine translation)
FAIRseq scripts (neural machine translation)
FloRes-dev as development set
FLoRes-devtest as development test set

Subsampling the corpus

Given your file with sentence-level quality scores, the script subselect.perl allows you to subsample sets with 5 million and 1 million English tokens.

The syntax to use the script is:

subselect.perl FILE_SCORE FILE_F FILE_E OUT

This will typically look something like this for Sinhala-English:

subselect.perl my-score-file.txt clean-eval-wmt19-raw.si clean-eval-wmt19-raw.en out

resulting in files with roughly the following properties

% wc out.5000000*
   279503   5000052  25967107 out.5000000.en
   279503   3456614  41708480 out.5000000.si

For Nepali-English the stats are:

% wc out.5000000*
   248765   5000018  31748929 out.5000000.en
   248765   3327811  48824341 out.5000000.ne

Building a Moses system

Training of a Moses system is done with experiment.perl. For detailed documentation on how to build machine translation systems with this script, please refer to the relevant Moses web page.

You will have to change the following configurations at the top of the ems-config.ne (or ems-config.si) configuration file, but everything else may stay the same.

These settings are full path names:

working-dir: a new directory in which experiment data will be stored
moses-src-dir: directory that contains the Moses installation
external-bin-dir: directory that contains fast align binaries
cleaneval-data: directory that contains the development sets (dev-tools/dev-sets)
my-corpus-stem: file name stem of the subsampled corpus (without .de or .en extensions)

With these changes, training a system is done via

$MOSES/scripts/ems/experiment.perl -config ems-config.ne -exec &> OUT &

and the resulting BLEU score is in the file evaluation/report.1.

Building a FAIRseq system

To build a FAIRseq baseline you can follow the instructions in the FLoRes MT Benchmark. There, you'll find an end-to-end script that will download, tokenize, build vocabularies and train the baseline system reported in this paper. To train and test a system on subsampled data, first preprocess the data with

$DEV_TOOLS/nmt/prepare.sh LANGUAGE DIR SUBSET_STEM FLORES

where

$DEV_TOOS is the location of the provided dev-tools package (see above)
LANGUAGE is either si or ne
DIR is the directory where experimental data is stored
FLORES is the checked out directory of the FLoRes MT Benchmark from github

Then, train the system by executing the following command from the DIR directory specificed above.

$DEV_TOOLS/nmt/train.sh LANGUAGE

After training, you can test performance on the development test set with

$DEV_TOOLS/nmt/translate.sh LANGUAGE

BASELINE RESULTS

We trained a Zipporah model on the provided clean data and obtained the following BLEU scores (not case sensitive).

Language 1 million 5 million

SMT NMT SMT NMT

Sinhala 4.16 4.65 4.77 3.74

Nepali 3.40 5.23 4.22 1.85

TEST SETS AND RESULTS

Results will be made available on June 3, 2019. The official results will be published in an overview paper at the WMT 2019 Conference for Machine Translation.

FREQUENTLY ASKED QUESTIONS

What data resources and tools can be used?

Any standard linguistic tools (POS taggers, parsers, etc.) may be used. But no additional parallel and monolingual data is allowed - only the data referred to above .

Should sentences be scored in isolation?

It is not required to score each sentence independent from others. You may consider scoring that take data redundancy into account, i.e., scores the second occurrence of a very similar sentence pair lower.

ORGANIZERS

Philipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Juan Pino, Facebook

ACKNOWLEDGEMENTS

This shared task is partially supported by Facebook, Paracrawl, and IARPA MATERIAL.

Language	1 million		5 million
	SMT	NMT	SMT	NMT
Sinhala	4.16	4.65	4.77	3.74
Nepali	3.40	5.23	4.22	1.85