This is all the pure text right after splitting into languages using Google's Compact Language Detector 2. The data is very noisy and contains a lot of boilerplate but it is also very much.
The file format is a very simple plain-text format:
df6fa1abb58549287111ba8d776733e9 1.000000 url content ...
where df6fa1abb58549287111ba8d776733e9 is a magic number to mark the start of a new block. The second element is a number that gives the index of the extracted block which can be useful if blocks in different languages were found on a single page.
Since at this point no tokenization or sentence splitting is done we report the size in byes:
LANG | Bytes (2012+2013) |
---|---|
en | 23,618,329,602,163 |
un | 4,367,264,610,469 |
de | 1,019,661,500,844 |
es | 986,863,932,133 |
fr | 912,158,947,446 |
ja | 577,139,792,775 |
ru | 537,364,268,817 |
pl | 334,309,399,324 |
it | 325,579,516,606 |
pt | 316,872,603,278 |
zh | 264,912,799,026 |
nl | 207,899,143,023 |
cs | 139,682,978,056 |
tr | 138,255,666,018 |
sv | 130,421,180,734 |
ar | 109,914,506,835 |
ro | 100,825,300,518 |
fa | 90,374,032,287 |
id | 90,186,279,228 |
hu | 86,761,753,366 |
vi | 77,684,501,671 |
th | 71,970,958,807 |
el | 67,123,604,246 |
da | 63,059,532,993 |
fi | 47,727,938,111 |
zh-Hant | 46,967,117,486 |
no | 44,683,775,875 |
ko | 42,212,933,823 |
uk | 32,959,660,768 |
ms | 31,988,273,379 |
bg | 29,465,357,109 |
sk | 29,016,615,633 |
sr | 27,536,190,203 |
iw | 25,070,386,589 |
ca | 24,274,065,896 |
hr | 24,131,691,489 |
lt | 22,030,944,312 |
sl | 14,059,233,548 |
lv | 11,390,955,330 |
hi | 10,997,195,948 |
ta | 10,539,311,295 |
et | 10,270,827,759 |
la | 8,408,419,782 |
war | 8,102,391,410 |
is | 7,730,642,988 |
ka | 5,428,243,707 |
bn | 3,908,650,702 |
nn | 3,901,018,216 |
hy | 3,798,173,037 |
tl | 3,481,010,578 |
sq | 3,474,406,259 |
my | 3,297,902,869 |
eu | 3,285,427,626 |
gl | 3,107,623,692 |
ml | 2,863,670,088 |
tt | 2,240,062,962 |
te | 2,223,186,973 |
be | 2,207,971,182 |
af | 2,162,668,040 |
ne | 2,055,216,862 |
mk | 1,918,809,982 |
mr | 1,637,482,872 |
cy | 1,509,091,610 |
az | 1,421,450,552 |
ur | 1,406,684,538 |
si | 1,403,876,502 |
kn | 1,330,260,001 |
fy | 1,317,046,051 |
so | 1,315,073,394 |
mn | 1,140,723,230 |
vo | 1,116,516,465 |
gu | 1,110,807,207 |
eo | 1,096,331,719 |
sa | 979,279,954 |
kk | 891,043,735 |
mt | 761,846,156 |
km | 723,100,866 |
sco | 705,685,695 |
ga | 700,428,417 |
co | 697,347,610 |
sw | 650,679,225 |
mg | 630,391,941 |
uz | 564,116,186 |
ku | 534,084,333 |
pa | 456,340,297 |
ie | 435,674,188 |
lb | 416,320,874 |
am | 295,441,288 |
haw | 287,017,875 |
rw | 285,620,476 |
jw | 280,734,879 |
ps | 264,819,752 |
mi | 262,351,363 |
bo | 257,492,214 |
ia | 254,949,178 |
dv | 249,066,656 |
ceb | 247,746,154 |
ht | 225,690,983 |
zzp | 221,637,820 |
yi | 221,374,976 |
ug | 219,219,100 |
gn | 218,753,912 |
blu | 216,771,780 |
su | 205,803,323 |
br | 198,602,623 |
rm | 196,640,675 |
ha | 192,939,072 |
fo | 184,472,723 |
ky | 177,978,912 |
ln | 171,589,365 |
gd | 171,367,162 |
lo | 155,873,885 |
oc | 149,596,564 |
tg | 121,964,958 |
zu | 117,166,801 |
wo | 113,587,953 |
qu | 105,027,938 |
kl | 87,559,836 |
syr | 86,670,052 |
tk | 85,597,368 |
bh | 82,584,050 |
kha | 73,385,571 |
aa | 69,270,273 |
crs | 66,286,045 |
rn | 60,295,470 |
ba | 57,775,821 |
gv | 57,107,664 |
sm | 52,769,223 |
ny | 52,209,325 |
sn | 49,592,637 |
to | 48,187,233 |
xh | 47,329,488 |
mfe | 46,555,077 |
yo | 42,375,087 |
st | 41,788,647 |
sd | 40,701,508 |
dz | 38,504,390 |
ti | 37,127,554 |
or | 34,877,485 |
bi | 33,532,208 |
iu | 32,357,666 |
om | 29,092,598 |
fj | 28,270,544 |
lg | 25,703,715 |
ts | 25,281,385 |
ig | 20,390,649 |
chr | 20,241,181 |
tn | 17,192,310 |
ik | 16,941,479 |
na | 16,516,781 |
ss | 16,105,170 |
tlh | 13,364,626 |
as | 12,367,138 |
ab | 8,669,857 |
ay | 8,649,728 |
ak | 8,261,920 |
za | 4,545,052 |
nso | 2,075,133 |
sg | 2,070,307 |
ve | 1,787,627 |
ks | 1,445,926 |
bs | 1,262,415 |
lif | 428,511 |
When using parts of this work, please cite:
@inproceedings{Buck-commoncrawl,
author = {Christian Buck and Kenneth Heafield and Bas van Ooyen},
title = {N-gram Counts and Language Models from the Common Crawl},
year = {2014},
month = {May},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
address = {Reykjavk, Iceland{i}k, Iceland}
}