Parallel Corpus Filtering Task - EMNLP 2018 Third Conference on Machine Translation

Shared Task: Parallel Corpus Filtering

This new shared task tackles the problem of cleaning noisy parallel corpora. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high quality sentence pairs.

DETAILS

Specifically, we provide a very noisy 1 billion word (English token count) German-English corpus crawled from the web as part of the Paracrawl project. We ask participants to subselect sentence pairs that amount to (a) 100 million words, and (b) 10 million words. The quality of the resulting subsets is determined by the quality of a statstical machine translation (Moses, phrase-based) and neural machine translation system (Marian) trained on this data. The quality of the machine translation system is measured by BLEU score on the (a) official WMT 2018 news translation test set and (b) another undisclosed test set.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. Hence, we discourage participants from subsampling the corpus for relevance to the news domain. Hence, we place more emphasis on the second undisclosed test set, although we will report both scores.

The provided raw parallel corpus is the outcome of a processing pipeline that aimed from high recall at the cost of precision, so it is very noisy. It exhibits noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

IMPORTANT DATES

Release of raw parallel data	April 1, 2018
Submission deadline for subsampled sets	June 22, 2018
System descriptions due	July 27, 2018
Announcement of results	July 9, 2018
Camera-ready for system descriptions	August 31, 2018

REGISTRATION

It is not necessary to register to the shared task before the submission deadline, but it is highly recommended to subscribe to the general WMT 2018 mailing list to get information about revisions and clarifications for the shared task definition.

RAW CORPUS DOWNLOAD

The raw corpus consists of a billion words of English (and German) — a deduplicated subset of the raw Paracrawl Release 1.

Download raw corpus (4.7 GB)

The provided gzipped file contains three items per line, separated by TAB:

English sentence
German sentence
Hunalign score

The Hunalign scores are reported by the sentence aligner. They may be a useful feature for sentence filtering, but they do not by themselve correlate strongly with sentence pair quality.

Another useful feature may take the source of the data into account, e.g., by discounting sentence pairs that come from a web domain with generally low quality scores. To this end, we release the URL sources for each sentence pair as additional data set.

Download URL sources for corpus (24 GB)

This data set is not deduplicated and contains the English and German URL from which each sentence pair is derived. Note that a sentence pair that occurs in the raw corpus may occur multiple times in this set.

USE OF EXTERNAL RESOURCES

You may use the WMT 2018 news translation task data for German-English (without the Paracrawl parallel corpus) to train components of your method.

SUBMISSIONS

To participate in the shared task, you have to submit a file with quality scores, one per line, corresponding to the sentence pairs. The scores do not have to be meaningful, except that higher scores indicate better quality.

Upload the file to the Google Drive folder. Please indicate in the file name clearly your affilliation and send an email to phi@jhu.edu to announce your submission.

DEVELOPMENT ENVIRONMENT

Evaluation of the quality scores will be done by subsampling 10m and 100m word corpora based on these scores, training statistical and neural machine translation systems with these corpora, and evaluation translation quality on blind test sets using the BLEU score.

For development purporses, we release configuration files and scripts that mirror the official testing procedure with a development test set.

Download development pack

The development pack consisist of

A script to subsample corpora based on quality scores
Moses configuration file (statistical machine translation)
Marian scripts (neural machine translation)
WMT2016 newstest as development set
WMT2017 newstest as development test set

Subsampling the corpus

Given your file with sentence-level quality scores, the script subsample.perl allows you to subsample sets with 10 million and 100 million English tokens.

The syntax to use the script is:

subsample.perl FILE_SCORE FILE_DE FILE_EN OUT

This will typically look something like

subsample.perl my-score-file.txt clean-eval-wmt18-raw.de clean-eval-wmt18-raw.en out

resulting in files with roughly the following properties

% wc out.10000000*
   1227667    9597527   66933930 out.10000000.de
   1227667    9972470   60811790 out.10000000.en
   7787705   94170427  667930143 out.100000000.de
   7787705   99644243  612484232 out.100000000.en
  18030744  213384667 1408160095 total

Building a Moses system

Training of a Moses system is done with experiment.perl. For detailed documentation on how to build machine translation systems with this script, please refer to the relevant Moses web page.

You will have to change the following configurations at the top of the ems-config configuration file, but everything else may stay the same.

These settings are full path names:

working-dir: a new directory in which experiment data will be stored
moses-src-dir: directory that contains the Moses installation
external-bin-dir: directory that contains fast align binaries
cleaneval-data: directory that contains the development sets (dev-tools/dev-sets)
my-corpus-stem: file name stem of the subsampled corpus (without .de or .en extensions)

With these changes, training a system is done via

$MOSES/scripts/ems/experiment.perl -config ems-config -exec &> OUT &

and the resulting BLEU score is in the file evaluation/report.1.

Building a Marian system

Training and testing a Marian neural machine translation system is done using a few scripts that have to be copied into an experiment directory.

You will only have to change a number of settings in the file local-settings.sh

GPU: GPU id used for training (e.g. 0)
mosesdecoder: Your checkout of the Moses decoder (https://github.com/moses-smt/mosesdecoder/). It is not necessary to compile it
subword_nmt: (https://github.com/rsennrich/subword-nmt)
marian: Installation of Marian (https://github.com/marian-nmt/marian)
devset: directory that contains the development sets (dev-tools/dev-sets)
my_corpus_stem: file name stem of the subsampled corpus (without .de or .en extensions)

With these code installations and specifications in local-settings.sh in place, training a neural system and testing its performance on the development test set involves the execution of the following scripts:

./preprocess.sh: preparation of the training data (including tokenization, truecasing, and byte-pair encoding)
./train.sh: training a system (this may take several hours to days, depending on corpus size and type of GPU)
./test.sh: testing the system on newstest2017

The test BLEU score can be found in the file data/test.bleu.

TEST SETS AND RESULTS

The test sets used are available. Preliminary results can be found here. The official results will be published in an overview paper at the WMT 2018 Conference for Machine Translation.

FREQUENTLY ASKED QUESTIONS

What data resources and tools can be used?

Any standard linguistic tools (POS taggers, parsers, etc.) may be used. But no additional paralllel and monolingual data is allowed - only the data released for the news translation task (German-English language pair) is allowed, except for the Paracrawl corpus released for that task since it is already filtered with a filtering method (from a larger raw set).

Should sentences be scored in isolation?

It is not required to score each sentence independent from others. You may consider scoring that take data redundancy into account, i.e., scores the second occurence of a very similar sentence pair lower.

ORGANIZERS

Philipp Koehn (Johns Hopkins University / University of Edinburgh)
Huda Khayrallah (Johns Hopkins University)
Kenneth Heafield (University of Edinburgh)
Mikel Forcada (University of Alacant)

ACKNOWLEDGEMENTS

This shared task is partially supported by a Google Faculty Research Award and the Connecting Europe Facility via the Paracrawl project.