Paracrawl Benchmarks

This page links to the data sets used in the ACL 2020 paper

Paracrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel Forcada, Amir Kamran Faheem Kirefu, Philipp Koehn, Sergio Ortiz-Rojas, Leopoldo Pla, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza
Annual Meeting of the Association for Computational Linguistics (ACL), 2020.

and contains instructions how to evaluate methods for document alignment, sentence alignment, and sentence pair filtering.

For more on the Paracrawl project, consult its web site.

Document Alignment

The task of document alignment is defined as finding for each non-English document its counterpart which contains the same content in English. Such a document may or may not exist. We restrict document alignment to documents within from the same webdomain.

German: download script, webdomain list (705 GB, 21,806 webdomains)
Czech: download script, webdomain list (473 GB, 12,179 webdomains)
Hunalign: download script, webdomain list (238 GB, 5,560 webdomains)
Estonian: download script, webdomain list (210 GB, 5,129 webdomains)
Maltese: download script, webdomain list (76 GB, 933 webdomains)

We did not provide any comparisons of methods for this task, but just used our default method (based on machine translation of the foreign documents and matching based on bag of n-grams).

Data format: The data is provided as a set of tar files, each containing one xz-compressed lett file for each webdomain. Each line in a lett file contains one document per line, with the following TAB-separated items: (1) language, (2) format (typically text/html), (3) encoding (always charset=utf-8), (4) URL, (5) base64 encoded raw document (e.g., including HTML markup), (6) base64 encoded extracted text. The extracted text is partially sentence-split, additional sentence splitting is required.

Sentence Alignment

The task of sentence alignment is defined as finding the correspondences between sentences in an English document and a non-English sentences, so that each pair of English sentences and non-English sentences express the same meaning. Typically, we aim to align a single English and a single non-English sentence, but one-to-many, many-to-one, and many-to-many alignments are also valid.

German (7.9 GB, 17,109,018 document pairs)
Czech (4.2 GB, 6,661,650 document pairs)
Hungarian (1.7 GB, 2,770,432 document pairs)
Estonian (1.7 GB, 2,301,309 document pairs)
Maltese (225 MB, 303,198 document pairs)

The paper provides comparison of Hunalign, Vecalign, and Bleualign (using statistical and neural machine translation systems). The resulting sentence pair collections are provided in the next section.

Data format: Each line in the provided files contains one document pair per line. Each line has the following TAB-separated items: (1) English URL, (2) non-English URL, (3) base64-encoded text of the English document, (4) base64-encoded text of the non-English document.

Sentence Pair Filtering

The task of sentence pair filtering is defined as coming up with a quality score for each sentence pair, so that a clean subset at some threshold value for scores can be selected.

To evaluate such quality measures, we release parallel corpora that were extracted from the aligned documents using different sentence alignment methods. These corpora are subsequently deduplicated, keeping only a sentence pair only once no matter how often it occurs in the corpus.

For comparison purposes, we release quality scores using Zipporah, Bicleaner and LASER.

Download script for all files (individual download links below)

German Hunalign, deduplicated (133,939,165 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
German Vecalign, deduplicated (147,869,112 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
German Bleualign (NMT), deduplicated (15,381,743 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
German Bleualign (SMT), deduplicated (18,381,989 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Czech Hunalign, deduplicated (57,121,723 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Czech Vecalign, deduplicated (65,976,212 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Czech Bleualign (NMT), deduplicated (4,817,308 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Czech Bleualign (SMT), deduplicated (6,360,535 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Hungarian Hunalign, deduplicated (24,646,784 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Hungarian Vecalign, deduplicated (27,759,909 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Hungarian Bleualign (NMT), deduplicated (1,802,035 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Hungarian Bleualign (SMT), deduplicated (2,272,127 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Estonian Hunalign, deduplicated (22,813,150 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Estonian Vecalign, deduplicated (19,300,059 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Estonian Bleualign (NMT), deduplicated (2,129,038 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Estonian Bleualign (SMT), deduplicated (2,947,394 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Maltese Hunalign, deduplicated (2,053,620 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Maltese Vecalign, deduplicated (2,686,133 sentence pairs), with Zipporah, Bicleaner, and LASER scores.
Maltese Bleualign (NMT), deduplicated (393,783 sentence pairs), with Zipporah, Bicleaner and LASER scores.
Maltese Bleualign (SMT), deduplicated (532,708 sentence pairs), with Zipporah, Bicleaner and LASER scores.

Data formats: The raw sentence aligned files contains one sentence pair per line. Each line has the following TAB-separated items: (1) English URL, (2) non-English URL, (3) English sentence, (4) non-English sentence, (5) optinal quality score from the aligner. The deduplicated files contain only the English sentence and non-English sentence, TAB-separated, with no pair repeated. LASER scores are computed on a pre-filtered subset, where sentence pairs that fail language identification and coverage overlap detectors are excluded. In the above files, the LASER scores come with the sentence pairs, while the other scores are released just by themselves, matching the deduplicated sentence pair files line by line.

Evaluation Setup

The evaluation of the mining methods follows the protocol established by the WMT Shared Tasks on Parallel Corpus Filtering - by subsampling the cleanest parallel corpora according to a quality score and then training machine translation systems on corpora of different sizes and the assessing its quality on a test set.

We provide scripts and test sets that allow you to

subsample the parallel corpus
train a neural machine translation system for each subset
score a test set with each system

Subsampling Script

Given your file with sentence-level quality scores, the script subselect.perl allows you to subsample sets with 5 million English tokens.

The syntax to use the script is:

subselect.perl FILE_SCORE FILE_F FILE_E OUT

To try this on the provided Bicleaner scores (this should result in the file sizes above), execute the following commands.

wget paracrawl-benchmark.en-mt.hunalign.dedup.xz
xzcat paracrawl-benchmark.en-mt.hunalign.dedup.xz | cut -f 1 > paracrawl-benchmark.en-mt.hunalign.dedup.e
xzcat paracrawl-benchmark.en-mt.hunalign.dedup.xz | cut -f 2 > paracrawl-benchmark.en-mt.hunalign.dedup.f
wget paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner.xz
xz -d paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner.xz
$DEV_TOOLS/subselect.perl \
  paracrawl-benchmark.en-mt.hunalign.dedup.bicleaner \
  paracrawl-benchmark.en-mt.hunalign.dedup.f \
  paracrawl-benchmark.en-mt.hunalign.dedup.e \
  subsample

This will result for instance into the following files for the Maltese-English Hunalign Bicleaner corpus:

% wc subsample.5000000*
  245650  4985648 29374766 subsample.5000000.e
  245650  4265140 31740888 subsample.5000000.f

Training a Neural Machine Translation System

To install fairseq you can follow the instructions in the fairseq github repository. Once this is done, set the environment variable FAIRSEQ to the directory into which you cloned the github repository, e.g.,

git clone https://github.com/pytorch/fairseq.git
cd fairseq
export FAIRSEQ=`pwd`

To train and test a system on subsampled data, first preprocess the data with

$DEV_TOOLS/prepare.sh LANGUAGE SYSTEM_DIR SUBSET_STEM

where

$DEV_TOOS is the location of the provided dev-tools package (see above)
LANGUAGE is either de, cs, hu, et, or mt
SYSTEM_DIR is the directory where experimental data is stored
SUBSET_STEM is the file stem (without the language extensions) of the filtered parallel corpus

Then, train the system by executing the following command in DIR directory specificed above.

bash $SYSTEM_DIR/train.sh

After training, you can test performance on the development test set with

bash $SYSTEM_DIR/translate.sh

Here is the sequence of commands for the example corpus:

$DEV_TOOLS/prepare.sh mt example-system subsample.5000000
bash example-system/train.sh
bash example-system/translate.sh

Evaluation of Sentence Pair Filtering

The evaluation of sentence pair filtering should take any of the sentence-aligned parallel corpora (deduplicated), assign score to each sentence pair and then proceed with the evaluation protocol.

Evaluation of Sentence Alignment

The evaluation of sentence alignment should take any of the document-aligned corpus collections, extract sentence pairs from them, deduplicate the resulting parallel corpus, score each sentence pair with Bitextor, and then proceed with the evaluation protocol.