commoncrawl_dedupe
program. This program removes lines that:
df6fa1abb58549287111ba8d776733e9
).Each deduped file is associated with an offset file which contains information about which byte range correspond to which crawl ID (as in commoncrawl.org/the-data/get-started/). An offset file could look like this:
2012,2013_20 123456
2015_11 456789
This means that the bytes 0-123456
contain the crawls 2012
and 2013_20
, and the bytes 123456-456789
contain the crawl 2015_11
.
NOTE: In some offset files the Crawl ID 2014_1
shows up. This refers to the file ${lang_code}.2014_1.raw.xz
from ngrams/raw and not to a specific CommonCrawl Crawl ID.