Common Crawl

encyclopedic internet natural language processing web archive

Description

A corpus of web crawl data composed of over 300 billion web pages.

Update Frequency

Monthly

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://commoncrawl.org/get-started

Managed By

See all datasets managed by Common Crawl.

Contact

https://commoncrawl.org/contact-us

How to Cite

Common Crawl was accessed on DATE from https://registry.opendata.aws/commoncrawl.

Usage Examples

Tutorials

Analysing Petabytes of Websites by Mark Litwintschik
Amazon EMR
Common Crawl Index Athena by Edward Ross
Amazon Athena
Get To Know A Dataset - Common Crawl by Common Crawl Foundation
Index to WARC Files and URLs in Columnar Format by Sebastian Nagel
Amazon Athena
Large-scale graph mining with Spark by Win Suen
One click to download all the web pages you may want by Jader Dias
Amazon AthenaAWS Lambda
Search the Common Crawl Using Lambda Functions by Andres Riancho
AWS Lambda

Tools & Applications

All Around The World: The Common Crawl Dataset - Attack Surface Research by Aliz Hammond
CCNet: Extracting high quality monolingual datasets from web crawl data by Facebook AI Research
Dresden Web Table Corpus (DWTC) by Database Systems Group Dresden
Glove: Global vectors for word representation by Jeffrey Pennington, Richard Socher, Christopher D. Manning
Learning word vectors for 157 languages by Facebook AI Research
Ransacking your password reset tokens by Lukas Euler
Search the html across 25 billion websites for passive reconnaissance using common crawl by Ryan Elkins

Publications

Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
Coyo-700m: Image-text pair dataset by Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
Index fun by Philippe Suter
LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
Language is not all you need: aligning perception with language models by Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, et al
Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
LLaMA: open and efficient foundation language models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, et al
Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, et al
Multimodal C4: an open, billion-scale corpus of images interleaved with text by Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, et al
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
No Language Left Behind: scaling human-centered machine translation by Costa-jussà, Marta R., James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, et al
Of using Common Crawl to play Family Feud by Paul Masurel
On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
Using open data to predict market movements by DELL EMC
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli

Resources on AWS

Description

Crawl data (WARC and ARC format)

Resource type

S3 Bucket Account Required

Amazon Resource Name (ARN)

arn:aws:s3:::commoncrawl

AWS Region

us-east-1

AWS CLI Access

aws s3 ls s3://commoncrawl/

Edit this dataset entry on GitHub

Tell us about your project