Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with internet.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with internet.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

Common Crawl

encyclopedicinternetnatural language processingweb archive

A corpus of web crawl data composed of over 50 billion web pages.

Details →

Usage examples

Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
Analysing Petabytes of Websites by Mark Litwintschik
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
Search the html across 25 billion websites for passive reconnaissance using common crawl by Ryan Elkins
Index to WARC Files and URLs in Columnar Format by Sebastian Nagel

See 35 usage examples →

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

analyticsbroadbandcitiescivicdisaster responsegeospatialglobalgovernment spendinginfrastructureinternetmappingnetwork trafficparquetregulatorytelecommunicationstiles

Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.

Details →

Usage examples

Launching Lonboard - A Python library for extremely fast geospatial vector data visualization in Jupyter by Kyle Barron
New Year, Great Data: The Best Ookla Open Data Projects We’ve Seen So Far by Katie Jolly
How to deliver performant GIS desktop applications with Amazon AppStream 2.0 by Ethan Fahy and Spencer DeBrosse
Bootstrapping Dask on 1000 cores with AWS Fargate by Imri Paran

See 4 usage examples →

DigitalCorpora

computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →

Usage examples

See 2 usage examples →

Common Screens

encyclopedicinternetnatural language processing

A corpus of web screenshot and metadata data composed of over 70 million websites.

Details →

Usage examples

IAB Text Classification by Common Screens

See 1 usage example →

End of Term Web Archive Dataset

archivesinternetnatural language processingweb archive

The End of Term Web Archive (EOT) captures and saves U.S. Government websites at the end of presidential administrations. The EOT has thus far preserved websites from administration changes in 2008, 2012, 2016, and 2020. Data from these web crawls have been made openly available in several formats in this dataset.

Details →

Usage examples

Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse by Mark Phillips and Sawood Alam

See 1 usage example →

A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)

cyber securityinternetintrusion detectionnetwork traffic

This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure incl...

Details →

MegaScenes

benchmarkcomputer visiondeep learninginternet

The MegaScenes Dataset is an extensive collection of around 430k scenes, featuring over 100k structure-from-motion reconstructions and over 2 million registered images. MegaScenes includes a diverse array of scenes, such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.

Details →

Usage examples

MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.
MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.
MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.

See 3 usage examples →

Open Observatory of Network Interference (OONI)

internet

A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.

Details →

SUCHO Ukrainian Cultural Heritage Web Archives

cultural preservationinternetukraine

The dataset contains web archives of Open Access collections of digitised cultural heritage from more than 3,000+ websites of Ukrainian cultural institutions, such as museums, libraries or archives. The web archives have been produced by SUCHO, which is a volunteer group of more than 1,300 international cultural heritage professionals – librarians, archivists, researchers, programmers - who have joined forces to save as much digitised cultural heritage during the 2022 invasion of Ukraine before the servers hosting them get destroyed, damaged or go offline for any other reason. The web archives...

Details →

ABEJA CC JA

internetjapanesenatural language processingweb archive

A large Japanese language corpus created through preprocessing Common Crawl data

Details →

Usage examples

Tutorial of ABEJA CC JA dataset by Kyo Hattori
Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing by Kyo Hattori

See 2 usage examples →