Deduplicated CommonCrawl Text

These are processed from the raw files using the commoncrawl_dedupe program. This program removes lines that:

Each deduped file is associated with an offset file which contains information about which byte range correspond to which crawl ID (as in commoncrawl.org/the-data/get-started/). An offset file could look like this:

2012,2013_20 123456
2015_11 456789

This means that the bytes 0-123456 contain the crawls 2012 and 2013_20, and the bytes 123456-456789 contain the crawl 2015_11.

NOTE: In some offset files the Crawl ID 2014_1 shows up. This refers to the file ${lang_code}.2014_1.raw.xz from ngrams/raw and not to a specific CommonCrawl Crawl ID.

Available languages are: If you want more, just ask us: crawl at kheafield.com.