Deduplicated CommonCrawl Text

These are processed from the raw files using the commoncrawl_dedupe program. This program removes lines that:

Have the 64-bit MurmurHash as a previously seen line (i.e. it deduplicates but occasionally removes hash collisions as well).
Contain invalid UTF-8.
Indicate the URLs in the raw format (they begin with df6fa1abb58549287111ba8d776733e9).

Each deduped file is associated with an offset file which contains information about which byte range correspond to which crawl ID (as in commoncrawl.org/the-data/get-started/). An offset file could look like this:


2012,2013_20 123456

2015_11 456789

This means that the bytes 0-123456 contain the crawls 2012 and 2013_20, and the bytes 123456-456789 contain the crawl 2015_11.

NOTE: In some offset files the Crawl ID 2014_1 shows up. This refers to the file ${lang_code}.2014_1.raw.xz from ngrams/raw and not to a specific CommonCrawl Crawl ID.

Available languages are:

If you want more, just ask us: crawl at kheafield.com.