Paracrawl Benchmarks
This page links to the data sets used in the ACL 2020 paper
Paracrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel Forcada, Amir Kamran Faheem Kirefu, Philipp Koehn, Sergio Ortiz-Rojas, Leopoldo Pla, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza
Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
and contains instructions how to evaluate methods for document alignment, sentence alignment, and sentence pair filtering.
For more on the Paracrawl project, consult its web site.
Document Alignment
The task of document alignment is defined as finding for each non-English document its counterpart which contains the same content in English. Such a document may or may not exist. We restrict document alignment to documents within from the same webdomain.
- German: download script, webdomain list (705 GB, 21,806 webdomains)
- Czech: download script, webdomain list (473 GB, 12,179 webdomains)
- Hunalign: download script, webdomain list (238 GB, 5,560 webdomains)
- Estonian: download script, webdomain list (210 GB, 5,129 webdomains)
- Maltese: download script, webdomain list (76 GB, 933 webdomains)
We did not provide any comparisons of methods for this task, but just used our default method (based on machine translation of the foreign documents and matching based on bag of n-grams).
Data format: The data is provided as a set of tar files, each containing one xz-compressed lett
file for each webdomain. Each line in a lett
file contains one document per line, with the following TAB-separated items: (1) language, (2) format (typically text/html
), (3) encoding (always charset=utf-8
), (4) URL, (5) base64 encoded raw document (e.g., including HTML markup), (6) base64 encoded extracted text. The extracted text is partially sentence-split, additional sentence splitting is required.
Sentence Alignment
The task of sentence alignment is defined as finding the correspondences between sentences in an English document and a non-English sentences, so that each pair of English sentences and non-English sentences express the same meaning. Typically, we aim to align a single English and a single non-English sentence, but one-to-many, many-to-one, and many-to-many alignments are also valid.
- German (7.9 GB, 17,109,018 document pairs)
- Czech (4.2 GB, 6,661,650 document pairs)
- Hungarian (1.7 GB, 2,770,432 document pairs)
- Estonian (1.7 GB, 2,301,309 document pairs)
- Maltese (225 MB, 303,198 document pairs)
The paper provides comparison of Hunalign, Vecalign, and Bleualign (using statistical and neural machine translation systems). The resulting sentence pair collections are provided in the next section.
Data format: Each line in the provided files contains one document pair per line. Each line has the following TAB-separated items: (1) English URL, (2) non-English URL, (3) base64-encoded text of the English document, (4) base64-encoded text of the non-English document.
Sentence Pair Filtering
The task of sentence pair filtering is defined as coming up with a quality score for each
sentence pair, so that a clean subset at some threshold value for scores can be
selected.
To evaluate such quality measures, we release parallel corpora that were
extracted from the aligned documents using different sentence alignment methods.
These corpora are subsequently deduplicated, keeping only a sentence pair only once
no matter how often it occurs in the corpus.
For comparison purposes, we release quality scores using Zipporah, Bicleaner
and LASER.
Download script for all files (individual download links below)
- German Hunalign,
deduplicated (133,939,165 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- German Vecalign,
deduplicated (147,869,112 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- German Bleualign (NMT),
deduplicated (15,381,743 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- German Bleualign (SMT),
deduplicated (18,381,989 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Czech Hunalign,
deduplicated (57,121,723 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Czech Vecalign,
deduplicated (65,976,212 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Czech Bleualign (NMT),
deduplicated (4,817,308 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Czech Bleualign (SMT),
deduplicated (6,360,535 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Hungarian Hunalign,
deduplicated (24,646,784 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Hungarian Vecalign,
deduplicated (27,759,909 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Hungarian Bleualign (NMT),
deduplicated (1,802,035 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Hungarian Bleualign (SMT),
deduplicated (2,272,127 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Estonian Hunalign,
deduplicated (22,813,150 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Estonian Vecalign,
deduplicated (19,300,059 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Estonian Bleualign (NMT),
deduplicated (2,129,038 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Estonian Bleualign (SMT),
deduplicated (2,947,394 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Maltese Hunalign,
deduplicated (2,053,620 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Maltese Vecalign,
deduplicated (2,686,133 sentence pairs), with
Zipporah,
Bicleaner, and
LASER scores.
- Maltese Bleualign (NMT),
deduplicated (393,783 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
- Maltese Bleualign (SMT),
deduplicated (532,708 sentence pairs), with
Zipporah,
Bicleaner and
LASER scores.
Data formats: The raw sentence aligned files contains one sentence pair per line. Each line has the following TAB-separated items: (1) English URL, (2) non-English URL, (3) English sentence, (4) non-English sentence, (5) optinal quality score from the aligner. The deduplicated files contain only the English sentence and non-English sentence, TAB-separated, with no pair repeated.
LASER scores are computed on a pre-filtered subset, where sentence pairs
that fail language identification and coverage overlap detectors are excluded. In the
above files, the LASER scores come with the sentence pairs, while the other scores are
released just by themselves, matching the deduplicated sentence pair files line by line.
Evaluation Setup
The evaluation of the mining methods follows the protocol established by the
WMT Shared Tasks on Parallel Corpus Filtering - by subsampling the
cleanest parallel corpora according to a quality score and then training
machine translation systems on corpora of different sizes and the assessing
its quality on a test set.
We provide scripts and test sets that allow you to
- subsample the parallel corpus
- train a neural machine translation system for each subset
- score a test set with each system
Subsampling Script
Given your file with sentence-level quality scores, the script subselect.perl
allows you to subsample sets with 5 million English tokens.
The syntax to use the script is:
subselect.perl FILE_SCORE FILE_F FILE_E OUT
To try this on the provided Bicleaner scores (this should result in the file sizes above), execute the following commands.
wget paracrawl-benchmark.en-mt.hunalign.dedup.xz
xzcat paracrawl-benchmark.en-mt.hunalign.dedup.xz | cut -f 1 > paracrawl-benchmark.en-mt.hunalign.dedup.e
xzcat paracrawl-benchmark.en-mt.hunalign.dedup.xz | cut -f 2 > paracrawl-benchmark.en-mt.hunalign.dedup.f
wget paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner.xz
xz -d paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner.xz
$DEV_TOOLS/subselect.perl \
paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner \
paracrawl-benchmark.en-mt.hunalign.dedup.f \
paracrawl-benchmark.en-mt.hunalign.dedup.e \
subsample
This will result for instance into the following files for the Maltese-English Hunalign Bicleaner corpus:
% wc subsample.5000000*
245650 4985648 29374766 subsample.5000000.e
245650 4265140 31740888 subsample.5000000.f
Training a Neural Machine Translation System
To install fairseq you can follow the instructions in the fairseq github repository.
Once this is done, set the environment variable FAIRSEQ to the directory into which you cloned the github repository, e.g.,
git clone https://github.com/pytorch/fairseq.git
cd fairseq
export FAIRSEQ=`pwd`
To train and test a system on subsampled data, first preprocess the data with
$DEV_TOOLS/prepare.sh LANGUAGE SYSTEM_DIR SUBSET_STEM
where
$DEV_TOOS
is the location of the provided dev-tools package (see above)
LANGUAGE
is either de
, cs
, hu
, et
, or mt
SYSTEM_DIR
is the directory where experimental data is stored
SUBSET_STEM
is the file stem (without the language extensions) of the filtered parallel corpus
Then, train the system by executing the following command in DIR
directory specificed above.
bash $SYSTEM_DIR/train.sh
After training, you can test performance on the development test set with
bash $SYSTEM_DIR/translate.sh
Here is the sequence of commands for the example corpus:
$DEV_TOOLS/prepare.sh mt example-system subsample.5000000
bash example-system/train.sh
bash example-system/translate.sh
Evaluation of Sentence Pair Filtering
The evaluation of sentence pair filtering should take any of the sentence-aligned
parallel corpora (deduplicated), assign score to each sentence pair and
then proceed with the evaluation protocol.
Evaluation of Sentence Alignment
The evaluation of sentence alignment should take any of the document-aligned
corpus collections, extract sentence pairs from them, deduplicate the resulting
parallel corpus, score each sentence pair with Bitextor, and then proceed with the evaluation protocol.