| Language Pair | Raw | Deduplicated | Clean | 
| French-English | 21GB 2.773 million segments 11.843 billion English tokens | 7.9GB 243 million segments 2.870 billion English tokens | 1.2GB 34 million segments 443 million English tokens | 
| German-English | 24GB 3.161 billion segments 14.006 billion English tokens | 7.7GB 264 million segments 2.731 billion English tokens | 1.2GB 36 million segments 425 million English tokens | 
| Italian-English | 8.4GB 1.190 billion segments 5.283 billion English tokens | 3.1GB 91 million segments 1.088 billion English tokens | 539MB 14 million segments 188 million English tokens | 
| Russian-English | 3.9GB 418 million segments 1.7 billion English tokens | 1.5GB 42 million segments 460 million English tokens | 149MB 2.8 million segments 40 million English tokens | 
| Spanish-English | 8.6GB 118 million segments 5.157 billion English tokens | 3.7GB 106 million segments 1.300 billion English tokens | 1.2GB 18 million segments 232 million English tokens | 
"Raw" and "Dedup" release of data includes URLs and a quality score (based on sentence alignment). Clean data is subset of Dedup with positive quality score.
Token count was computed with wc on the raw untokenized text.
| Language Pair | File Size (xz) | Segment Pairs | Tokens (English) | 
| French-English | 367M | 11,808,682 | 137,821,373 | 
| German-English | 374M | 12,169,115 | 136,793,414 | 
| Italian-English | 158M | 4,312,241 | 55,915,710 | 
| Russian-English | 34M | 772,571 | 12,344,705 | 
| Spanish-English | 188M | 5,508,752 | 68,419,109 | 
wc on the raw untokenized text.