commoncrawl_dedupe program. This program removes lines that:
df6fa1abb58549287111ba8d776733e9).Each deduped file is associated with an offset file which contains information about which byte range correspond to which crawl ID (as in commoncrawl.org/the-data/get-started/). An offset file could look like this:
2012,2013_20 123456
2015_11 456789
This means that the bytes 0-123456 contain the crawls 2012 and 2013_20, and the bytes 123456-456789 contain the crawl 2015_11.
NOTE: In some offset files the Crawl ID 2014_1 shows up. This refers to the file ${lang_code}.2014_1.raw.xz from ngrams/raw and not to a specific CommonCrawl Crawl ID.