Following WMT18 shared task on parallel corpus filtering, we now pose the problem under more challenging low-resource conditions. Instead of German-English, this year there are two language pairs, Nepali-English and Sinhala-English.
Otherwise, the shared task follows the same set-up. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high-quality sentence pairs.
We also provide links to training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance. We provide a script to fetch and compose the training data.
Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.
The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).
Release of raw parallel data | February 8, 2019 |
Submission deadline for subsampled sets | May 10, 2019 |
System descriptions due | May 17, 2019 |
Announcement of results | June 3, 2019 |
Paper notification | June 7, 2019 |
Camera-ready for system descriptions | June 17, 2019 |
UPDATE: Download improved version of Nepali corpus (165M).
The provided tar ball contains the Nepalese-English and Sinhala-English corpus in Moses format, i.e., one sentence pair lines, with corresponding lines in the English and foreign file.
Corpus | Sentence pairs | English words | Source Files | Comment |
Bible (two translations) | 61,645 | 1,507,905 | English.xml English-WEB.xml Nepali.xml | The extraction script can be found here |
Global Voices | 2,892 | 75,197 | Global Voices (all) | Contains many languages. Only use En-Ne |
Penn Tree Bank | 4,199 | 88,758 | NepaliTaggedCorpus.zip | Corpus needs realigning. Apply patch found here |
GNOME / KDE / Ubuntu | 494,994 | 2,018,631 | GNOME KDE4 Ubuntu | |
Nepali Dictionary | 9,916 | 25,058 | dictionaries.tar.gz | Link contains all languages |
Corpus | Sentence pairs | English words | Source Files | Comment |
Open Subtitles | 601,164 | 3,594,769 | OPUS-OpenSubtitles18 | |
GNOME / KDE / Ubuntu | 45,617 | 150,513 | GNOME KDE4 Ubuntu |
Corpus | Sentences | Words | Source files |
Filtered Sinhala Wikipedia | 155,946 | 4,695,602 | wikipedia.si_filtered.gz |
Filtered Nepali Wikipedia | 92,296 | 2,804,439 | wikipedia.ne_filtered.gz |
Filtered English Wikipedia | 67,796,935 | 1,985,175,324 | wikipedia.en_filtered.gz |
Filtered Sinhala Common Crawl | 5,178,491 | 110,270,445 | commoncrawl.deduped.si.xz |
Filtered Nepali Common Crawl | 3,562,373 | 102,988,609 | commoncrawl.deduped.ne.xz |
Filtered English Common Crawl | 380,409,891 | 8,894,266,960 | commoncrawl.deduped.en.xz |
Corpus | Sentences | Words | Source Files |
Parallel IITB Hindi-English Corpus | 1,492,827 | 20,667,240 | parallel.tgz |
Monolingual IITB Hindi Corpus | 67,796,935 | 1,985,175,324 | monolingual.hi.tgz |
Upload the file to the
Google Drive folder.
Please indicate in the file name
clearly your affiliation and send an email to phi@jhu.edu
to announce
your submission.
For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.
The development pack consists of
subselect.perl
allows you to subsample sets with 5 million and 1 million English tokens.
The syntax to use the script is:
subselect.perl FILE_SCORE FILE_F FILE_E OUT
This will typically look something like this for Sinhala-English:
subselect.perl my-score-file.txt clean-eval-wmt19-raw.si clean-eval-wmt19-raw.en outresulting in files with roughly the following properties
% wc out.5000000* 279503 5000052 25967107 out.5000000.en 279503 3456614 41708480 out.5000000.siFor Nepali-English the stats are:
% wc out.5000000* 248765 5000018 31748929 out.5000000.en 248765 3327811 48824341 out.5000000.ne
experiment.perl
.
For detailed documentation on how to build machine translation systems
with this script, please refer to the relevant
Moses web page.
You will have to change the following configurations at the top of the
ems-config.ne
(or ems-config.si
) configuration file, but everything else may stay the same.
These settings are full path names:
working-dir
: a new directory in which experiment data will be stored
moses-src-dir
: directory that contains the Moses installation
external-bin-dir
: directory that contains fast align binaries
cleaneval-data
: directory that contains the development sets (dev-tools/dev-sets
)
my-corpus-stem
: file name stem of the subsampled corpus (without .de
or .en
extensions)
$MOSES/scripts/ems/experiment.perl -config ems-config.ne -exec &> OUT &and the resulting BLEU score is in the file
evaluation/report.1
.
$DEV_TOOLS/nmt/prepare.sh LANGUAGE DIR SUBSET_STEM FLORESwhere
$DEV_TOOS
is the location of the provided dev-tools package (see above)
LANGUAGE
is either si
or ne
DIR
is the directory where experimental data is stored
FLORES
is the checked out directory of the FLoRes MT Benchmark from github
DIR
directory specificed above.
$DEV_TOOLS/nmt/train.sh LANGUAGEAfter training, you can test performance on the development test set with
$DEV_TOOLS/nmt/translate.sh LANGUAGE
Language | 1 million | 5 million | ||
SMT | NMT | SMT | NMT | |
Sinhala | 4.16 | 4.65 | 4.77 | 3.74 |
Nepali | 3.40 | 5.23 | 4.22 | 1.85 |