This new shared task tackles the problem of cleaning noisy parallel corpora. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high quality sentence pairs.
Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. Hence, we discourage participants from subsampling the corpus for relevance to the news domain. Hence, we place more emphasis on the second undisclosed test set, although we will report both scores.
The provided raw parallel corpus is the outcome of a processing pipeline that aimed from high recall at the cost of precision, so it is very noisy. It exhibits noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).
Release of raw parallel data | April 1, 2018 |
Submission deadline for subsampled sets | June 22, 2018 |
System descriptions due | July 27, 2018 |
Announcement of results | July 9, 2018 |
Camera-ready for system descriptions | August 31, 2018 |
The provided gzipped file contains three items per line, separated by TAB:
The Hunalign scores are reported by the sentence aligner. They may be a useful feature for sentence filtering, but they do not by themselve correlate strongly with sentence pair quality.
Another useful feature may take the source of the data into account, e.g., by discounting sentence pairs that come from a web domain with generally low quality scores. To this end, we release the URL sources for each sentence pair as additional data set.
Download URL sources for corpus (24 GB)
This data set is not deduplicated and contains the English and German URL from which each sentence pair is derived. Note that a sentence pair that occurs in the raw corpus may occur multiple times in this set.
Upload the file to the
Google Drive folder.
Please indicate in the file name
clearly your affilliation and send an email to phi@jhu.edu
to announce
your submission.
For development purporses, we release configuration files and scripts that mirror the official testing procedure with a development test set.
The development pack consisist of
subsample.perl
allows you to subsample sets with 10 million and 100 million English tokens.
The syntax to use the script is:
subsample.perl FILE_SCORE FILE_DE FILE_EN OUT
This will typically look something like
subsample.perl my-score-file.txt clean-eval-wmt18-raw.de clean-eval-wmt18-raw.en outresulting in files with roughly the following properties
% wc out.10000000* 1227667 9597527 66933930 out.10000000.de 1227667 9972470 60811790 out.10000000.en 7787705 94170427 667930143 out.100000000.de 7787705 99644243 612484232 out.100000000.en 18030744 213384667 1408160095 total
experiment.perl
.
For detailed documentation on how to build machine translation systems
with this script, please refer to the relevant
Moses web page.
You will have to change the following configurations at the top of the
ems-config
configuration file, but everything else may stay the same.
These settings are full path names:
working-dir
: a new directory in which experiment data will be stored
moses-src-dir
: directory that contains the Moses installation
external-bin-dir
: directory that contains fast align binaries
cleaneval-data
: directory that contains the development sets (dev-tools/dev-sets
)
my-corpus-stem
: file name stem of the subsampled corpus (without .de
or .en
extensions)
$MOSES/scripts/ems/experiment.perl -config ems-config -exec &> OUT &and the resulting BLEU score is in the file
evaluation/report.1
.
You will only have to change a number of settings in the file local-settings.sh
GPU
: GPU id used for training (e.g. 0
)
mosesdecoder
: Your checkout of the Moses decoder (https://github.com/moses-smt/mosesdecoder/
). It is not necessary to compile it
subword_nmt
: (https://github.com/rsennrich/subword-nmt
)
marian
: Installation of Marian (https://github.com/marian-nmt/marian
)
devset
: directory that contains the development sets (dev-tools/dev-sets
)
my_corpus_stem
: file name stem of the subsampled corpus (without .de
or .en
extensions)
local-settings.sh
in place,
training a neural system and testing its performance on the development test set involves
the execution of the following scripts:
./preprocess.sh
: preparation of the training data (including tokenization, truecasing, and byte-pair encoding)
./train.sh
: training a system (this may take several hours to days, depending on corpus size and type of GPU)
./test.sh
: testing the system on newstest2017
data/test.bleu
.