Here is the data: raw, deduped, and LMs.
For English we provide the raw data in several files that were sharded by a hash value of the line, so that identical lines end up in the same shard.
The processing pipeline uses standard Moses tools and a truecasing model estimated on a large dataset. For English, Spanish, and French the pipeline is available here, more languages will follow soon.
When using parts of this work, please cite our LREC paper:
@inproceedings{Buck-commoncrawl,
author = {Christian Buck and Kenneth Heafield and Bas van Ooyen},
title = {N-gram Counts and Language Models from the Common Crawl},
year = {2014},
month = {May},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
address = {Reykjavik, Iceland}
}
This work was supported by the MateCat project, which is funded by the EC under the 7th Framework Programme. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, we used Stampede provided by the Texas Advanced Computing Center (TACC) at The University of Texas at Austin under NSF XSEDE allocation TGCCR140009. We also acknowledge the support of the Defense Advanced Research Projects Agency (DARPA) Broad Operational Language Translation (BOLT) program through IBM. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA or the US government.