Description
A corpus of web crawl data composed of over 50 billion web pages.
Update Frequency
Monthly
License
This data is available for anyone to use under the Common Crawl Terms of Use
Documentation
https://commoncrawl.org/the-data/get-started/
Managed By
See all datasets managed by Common Crawl.
Contact
https://commoncrawl.org/connect/contact-us/
How to Cite
Common Crawl was accessed on DATE
from https://registry.opendata.aws/commoncrawl.
Usage Examples
Tutorials
Tools & Applications
Publications
-
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
-
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
-
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
-
CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
-
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
-
Coyo-700m: Image-text pair dataset by Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim
-
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
-
Index fun by Philippe Suter
-
LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
-
Language is not all you need: aligning perception with language models by Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, et al
-
Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
-
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
-
LLaMA: open and efficient foundation language models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, et al
-
Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
-
mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, et al
-
Multimodal C4: an open, billion-scale corpus of images interleaved with text by Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, et al
-
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
-
No Language Left Behind: scaling human-centered machine translation by Costa-jussà, Marta R., James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, et al
-
Of using Common Crawl to play Family Feud by Paul Masurel
-
On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
-
Using open data to predict market movements by DELL EMC
-
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli