Description
A corpus of web crawl data composed of over 50 billion web pages.
Update Frequency
Monthly
License
This data is available for anyone to use under the Common Crawl Terms of Use
Documentation
https://commoncrawl.org/the-data/get-started/
Managed By

See all datasets managed by Common Crawl.
Contact
https://commoncrawl.org/connect/contact-us/
How to Cite
Common Crawl was accessed on DATE
from https://registry.opendata.aws/commoncrawl.
Usage Examples
Tutorials
Tools & Applications
Publications
-
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
-
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
-
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
-
CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
-
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
-
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
-
Index fun by Philippe Suter
-
Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
-
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
-
Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
-
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
-
Of using Common Crawl to play Family Feud by Paul Masurel
-
On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
-
Using open data to predict market movements by DELL EMC
-
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli