ParaCrawl: Large-Scale Parallel Web Crawl

ParaCrawl

Large-Scale Parallel Web Crawl

This corpus was created by crawling a large number of sites across the web.

This is ongoing work. The corpus published here (official release 2016) is fairly noisy and covers only few language pairs. For more recent versions of the corpus, please see the website of the Paracrawl project.

A small part of this corpus was used for the WMT 2016 Bilingual Document Alignment Shared Task. The overview paper describes some of the processing:
Findings of the WMT 2016 Bilingual Document Alignment Shared Task, Christian Buck and Philipp Koehn, Proceedings of the First Conference on Machine Translation (WMT), 2016, bib

First Official Release 2016

Language Pair	Raw	Deduplicated	Clean
French-English	21GB 2.773 million segments 11.843 billion English tokens	7.9GB 243 million segments 2.870 billion English tokens	1.2GB 34 million segments 443 million English tokens
German-English	24GB 3.161 billion segments 14.006 billion English tokens	7.7GB 264 million segments 2.731 billion English tokens	1.2GB 36 million segments 425 million English tokens
Italian-English	8.4GB 1.190 billion segments 5.283 billion English tokens	3.1GB 91 million segments 1.088 billion English tokens	539MB 14 million segments 188 million English tokens
Russian-English	3.9GB 418 million segments 1.7 billion English tokens	1.5GB 42 million segments 460 million English tokens	149MB 2.8 million segments 40 million English tokens
Spanish-English	8.6GB 118 million segments 5.157 billion English tokens	3.7GB 106 million segments 1.300 billion English tokens	1.2GB 18 million segments 232 million English tokens

"Raw" and "Dedup" release of data includes URLs and a quality score (based on sentence alignment). Clean data is subset of Dedup with positive quality score.

Token count was computed with wc on the raw untokenized text.

Preliminary Release

Language Pair	File Size (xz)	Segment Pairs	Tokens (English)
French-English	367M	11,808,682	137,821,373
German-English	374M	12,169,115	136,793,414
Italian-English	158M	4,312,241	55,915,710
Russian-English	34M	772,571	12,344,705
Spanish-English	188M	5,508,752	68,419,109

Token count was computed with wc on the raw untokenized text.

Contact

Philipp Koehn
University of Edinburgh / Johns Hopkins University
(phi@jhu.edu)

Acknowledgment

This corpus was created with partial support by a Google Faculty Research Award.