Language Pair | Raw | Deduplicated | Clean |
French-English | 21GB 2.773 million segments 11.843 billion English tokens |
7.9GB 243 million segments 2.870 billion English tokens |
1.2GB 34 million segments 443 million English tokens |
German-English | 24GB 3.161 billion segments 14.006 billion English tokens |
7.7GB 264 million segments 2.731 billion English tokens |
1.2GB 36 million segments 425 million English tokens |
Italian-English | 8.4GB 1.190 billion segments 5.283 billion English tokens |
3.1GB 91 million segments 1.088 billion English tokens |
539MB 14 million segments 188 million English tokens |
Russian-English | 3.9GB 418 million segments 1.7 billion English tokens |
1.5GB 42 million segments 460 million English tokens |
149MB 2.8 million segments 40 million English tokens |
Spanish-English | 8.6GB 118 million segments 5.157 billion English tokens |
3.7GB 106 million segments 1.300 billion English tokens |
1.2GB 18 million segments 232 million English tokens |
"Raw" and "Dedup" release of data includes URLs and a quality score (based on sentence alignment). Clean data is subset of Dedup with positive quality score.
Token count was computed with wc
on the raw untokenized text.
Language Pair | File Size (xz) | Segment Pairs | Tokens (English) |
French-English | 367M | 11,808,682 | 137,821,373 |
German-English | 374M | 12,169,115 | 136,793,414 |
Italian-English | 158M | 4,312,241 | 55,915,710 |
Russian-English | 34M | 772,571 | 12,344,705 |
Spanish-English | 188M | 5,508,752 | 68,419,109 |
wc
on the raw untokenized text.