natural language processing
A corpus of web crawl data composed of over 25 billion web pages.
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, et al.
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
Dresden Web Table Corpus (DWTC) by Database Systems Group Dresden
Index to WARC Files and URLs in Columnar Format by Sebastian Nagel
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, et al.
Large-scale graph mining with Spark by Win Suen
Learning word vectors for 157 languages by Facebook AI Research
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
Of using Common Crawl to play Family Feud by Paul Masurel
Using open data to predict market movements by DELL EMC
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli
Resources on AWS
- Crawl data (WARC and ARC format)
- Resource type
- S3 Bucket
- Amazon Resource Name (ARN)
- AWS Region
Edit this dataset entry on GitHub