Registry of Open Data on AWS

Usage examples for all datasets listed in the Registry of Open Data on AWS tagged with internet.

Common Crawl

Tutorials

Analysing Petabytes of Websites by Mark Litwintschik
Amazon EMR
Common Crawl Index Athena by Edward Ross
Amazon Athena
Index to WARC Files and URLs in Columnar Format by Sebastian Nagel
Amazon Athena
Large-scale graph mining with Spark by Win Suen
One click to download all the web pages you may want by Jader Dias
Amazon AthenaAWS Lambda
Search the Common Crawl Using Lambda Functions by Andres Riancho
AWS Lambda

Tools & Applications

All Around The World: The Common Crawl Dataset - Attack Surface Research by Aliz Hammond
CCNet: Extracting high quality monolingual datasets from web crawl data by Facebook AI Research
Dresden Web Table Corpus (DWTC) by Database Systems Group Dresden
Glove: Global vectors for word representation by Jeffrey Pennington, Richard Socher, Christopher D. Manning
Learning word vectors for 157 languages by Facebook AI Research
Ransacking your password reset tokens by Lukas Euler
Search the html across 25 billion websites for passive reconnaissance using common crawl by Ryan Elkins

Publications

Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
Coyo-700m: Image-text pair dataset by Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
Index fun by Philippe Suter
LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
Language is not all you need: aligning perception with language models by Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, et al
Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
LLaMA: open and efficient foundation language models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, et al
Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, et al
Multimodal C4: an open, billion-scale corpus of images interleaved with text by Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, et al
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
No Language Left Behind: scaling human-centered machine translation by Costa-jussà, Marta R., James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, et al
Of using Common Crawl to play Family Feud by Paul Masurel
On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
Using open data to predict market movements by DELL EMC
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

Tutorials

Bootstrapping Dask on 1000 cores with AWS Fargate by Imri Paran
AWS FargateAmazon EKS
How to deliver performant GIS desktop applications with Amazon AppStream 2.0 by Ethan Fahy and Spencer DeBrosse
Amazon AppStream 2.0Amazon RDS
Launching Lonboard - A Python library for extremely fast geospatial vector data visualization in Jupyter by Kyle Barron
New Year, Great Data: The Best Ookla Open Data Projects We’ve Seen So Far by Katie Jolly

DigitalCorpora

Publications

Common Screens

Tutorials

IAB Text Classification by Common Screens

End of Term Web Archive Dataset

Publications

Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse by Mark Phillips and Sawood Alam

MegaScenes

Tutorials

MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.

Tools & Applications

MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.

Publications

MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.

ABEJA CC JA

Tutorials

Tutorial of ABEJA CC JA dataset by Kyo Hattori

Publications

Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing by Kyo Hattori

If you want to add a dataset or usage example to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository or tell us about your project.