natural language processing
A corpus of web crawl data composed of over 25 billion web pages.
See all datasets managed by Common Crawl.
Tools & Applications
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, et al.
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, et al.
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
Of using Common Crawl to play Family Feud by Paul Masurel
Using open data to predict market movements by DELL EMC
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli
Resources on AWS
- Crawl data (WARC and ARC format)
- Resource type
- S3 Bucket
- Amazon Resource Name (ARN)
- AWS Region
- AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
Edit this dataset entry on GitHub