About

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.


Search datasets (currently 13 matching datasets)


Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Sentinel-2

earth observationsatellite imagerygisnatural resource

The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in on-going studies. L1C data are available from June 2015 globally. L2A data are available from April 2017 over wider Europe region, planned to be expanded globally in July 2018.

Details →

Usage examples

See 14 usage examples →

Global Database of Events, Language and Tone (GDELT)

events

This project Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

Details →

Usage examples

See 4 usage examples →

IRS 990 Filings

regulatorystatistics

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present.

Details →

Usage examples

See 4 usage examples →

Terrain Tiles

elevationearth observationgis

A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.

Details →

Usage examples

See 4 usage examples →

Common Crawl

encyclopedicmachine learninginternet

A corpus of web crawl data composed of over 5 billion web pages.

Details →

Usage examples

See 3 usage examples →

OpenStreetMap on AWS

mappingosm

OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3.

Details →

Usage examples

See 3 usage examples →

MODIS on AWS

gissatellite imagerynatural resourcesustainability

Select products from the Moderate Resolution Imaging Spectroradiometer (MODIS) managed by the U.S. Geological Survey and NASA.

Details →

Usage examples

See 2 usage examples →

NASA NEX

earth observationnatural resourceclimatesustainability

A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth's surface.

Details →

Usage examples

See 2 usage examples →

New York City Taxi and Limousine Commission (TLC) Trip Record Data

citiesurbantransportation

Data of trips taken by taxis and for-hire vehicles in New York City.

Details →

Usage examples

See 2 usage examples →

Sentinel-1

earth observationsatellite imagerygis

Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications.

Details →

Usage examples

See 2 usage examples →

1000 Genomes

geneticgenomiclife sciences

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Details →

Usage examples

See 1 usage example →

Amazon Bin Image Dataset

computer visionmachine learning

The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

Details →

Usage examples

See 1 usage example →

NAIP on AWS

earth observationsatellite imagerygisnatural resourceregulatory

1 meter aerial imagery captured during the agricultural growing seasons in the continental U.S. More details here.

Details →

Usage examples

See 1 usage example →

OpenAQ

air qualitycitiesenvironmentalgissustainability

Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.

Details →

Usage examples

See 4 usage examples →

SpaceNet on AWS

giscomputer visionmachine learningearth observation

A corpus of commercial satellite imagery and labeled training data to foster innovation in the development of computer vision algorithms.

Details →

Usage examples

See 1 usage example →

TCGA on AWS

cancergenomiclife sciences

The Cancer Genome Atlas (TCGA) is a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to accelerate our understanding of the molecular basis of cancer. TCGA-funded researchers across the United States have produced a corpus of raw and processed genomic, transcriptomic, and epigenomic data from thousands of cancer patients.

Details →

Usage examples

See 1 usage example →

U.S. Census ACS PUMS

statisticscensussurvey

U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model.

Details →

Usage examples

See 1 usage example →

USAspending.gov

regulatoryeconomicsstatisticsus

USAspending.gov database, which includes data on all spending by the federal government, including contracts, grants, loans, employee salaries, and more.

Details →

Usage examples

See 1 usage example →

3000 Rice Genomes Project

agriculturefood securitygeneticgenomiclife sciences

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.

Details →

Amazon Customer Reviews Dataset

natural language processinginformation retrievalmachine learning

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. Over 130+ million customer reviews are available to researchers as part of this dataset.

Details →

CCAFS-Climate Data

agriculturefood securityclimatesustainability

High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.

Details →

Deutsche Börse Public Dataset

market datafinancial marketstrading

The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading engines. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal.

Details →

District of Columbia - Classified Point Cloud LiDAR

giscitiesus-dc

LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3. This dataset, managed by the Office of the Chief Technology Officer (OCTO), through the direction of the District of Columbia GIS program, contains tiled point cloud data for the entire District along with associated metadata.

Details →

EPA Risk-Screening Environmental Indicators

environmental

Detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model.

Details →

GOES on AWS

gisweatherearth observationmeteorologicalsustainability

GOES provides continuous weather imagery and monitoring of meteorological and space environment data across North America.

Details →

Genome in a Bottle on AWS

genomiclife sciences

Several reference genomes to enable translation of whole human genome sequencing to clinical practice.

Details →

Global Surface Summary of Day

environmentalclimateweathernatural resourceregulatorysustainability

GSOD is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, and more) from 9000+ weather stations around the world.

Details →

Google Books Ngrams

natural language processing

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Details →

HIRLAM Weather Model

earth observationclimateweathermeteorologicalsustainability

HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute.

Details →

ICGC on AWS

cancergenomiclife sciences

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Details →

Multimedia Commons

computer visionmachine learningmultimedia

The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, so it is free for anyone to use for any purpose.

Details →

NOAA Global Forecast System (GFS) Model

climateweatherenvironmentalsustainability

The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The entire globe is covered by the GFS at a base horizontal resolution of 18 miles (28 kilometers) between grid points, which is used by the operational forecasters who predict weather out to 16 days in the future. Horizontal resolution drops to 44 miles (70 kilometers) between grid point for forecasts between one week and two weeks.

Details →

NOAA High-Resolution Rapid Refresh (HRRR) Model

climateweatherenvironmentalsustainability

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Details →

Nanopore Reference Human Genome

genomiclife sciences

This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.

Details →

Open Observatory of Network Interference

internet

A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.

Details →

OpenNeuro

biologyimagingneurobiologyneuro imaging

OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenfMRI resource has been funded by the National Science Foundation, National Institute on Drug Abuse, and the Laura and John Arnold Foundation.

Details →

OpenStreetMap Linear Referencing

gistrafficosm

OSMLR a linear referencing system built on top of OpenStreetMap. OSM has great information about roads around the world and their interconnections, but it lacks the means to give a stable identifier to a stretch of roadway. OSMLR provides a stable set of numerical IDs for every 1 kilometer stretch of roadway around the world. In urban areas, OSMLR IDs are attached to each block of roadways between significant intersections.

Details →

Physionet

biologylife sciences

PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).

Details →

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

Details →

The Human Connectome Project

neuro imaginglife sciences

The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain.

Details →

The Human Microbiome Project

life sciences

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.

Details →

UK Met Office Global and Regional Weather Forecasts

earth observationclimateweathermeteorologicalsustainability

Archive data from the UK Met Office Global and Regional Ensemble Prediction System (MOGREPS) available on Amazon S3. Data from two models is available: MOEGREPS-UK, a high resolution weather forecast covering the United Kingdom, and MOGREPS-G, a global weather forecast.

Details →

CBERS on AWS

earth observationgisimagingsatellite imagery

This project creates a S3 repository with imagery acquired by the China-Brazil Earth Resources Satellite (CBERS). The image files are recorded and processed by Instituto Nacional de Pesquisa Espaciais (INPE) and are converted to Cloud Optimized Geotiff format in order to optimize its use for cloud based applications. Currently the repository contains all CBERS-4 MUX images acquired since the start of the CBERS-4 mission.

Details →

Usage examples

See 2 usage examples →

American Ninja Warrior Obstacle History

multimediaeventssports

Obstacle history of American Ninja Warrior seasons 1-9 This dataset includes every obstacle in the history of American Ninja Warrior from season 1 to 9. This includes the obstacles at Sasuke (also known as the original Ninja Warrior in Japan) during seasons 1-3 when American Ninja Warrior (ANW) was on G4, and the top 10 competitors from the semi-finals round of ANW were sent to Sasuke to compete. Starting in season 4 of ANW, which is known as the "NBC era" when the show took on the regional/city formats for both qualifying and semi-final rounds with the finalists from each region competing at the National Finals of ANW in Las Vegas.

Details →

Usage examples

See 1 usage example →

Collection of daily coin data from Coin Metrics

financial marketseconomicsbitcoinblockchain

This project is set to pull the latest daily coin data from Coin Metrics using the data.world sync applet on IFTTT. Daily on-chain transaction volume is calculated as the sum of all transaction outputs belonging to the blocks mined on the given day. "Change" outputs are not included. Transaction count figure doesn’t include coinbase transactions.

Details →

Usage examples

See 1 usage example →

Federal Government Awards

censusgovernment spendingregulatoryus

The Federal Awards dataset contains a complete export of the data available from USASpending. This dataset reflects all observations submitted through the third quarter of fiscal year 2017.

Details →

Usage examples

See 1 usage example →

Linguistic data of 32k film subtitles with IMDb meta-data

natural language processingmultimediamediamoviessubtitlessentimenttext analysis

Meta-data for more than 32 thousand films. These meta-data are matched to word-count categories from subtitle files.

Details →

Usage examples

See 1 usage example →

NFA 2017 - Ecological Resource Use and Resource Capacity of Nations from 1961 to 2013

environmentalclimateeconomicslife sciencessustainability

Our National Footprint Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2013. The calculations in the National Footprint Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency.

Details →

Usage examples

See 1 usage example →

Translated Sacred Text Word Counts

natural language processingmachine learning

Counts of words used in English-language translations of sacred texts, with flag for common words.

Details →

Usage examples

See 1 usage example →

DigitalGlobe Open Data Program

earth observationdisaster responsegissatellite imagery

Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. Also incudes crowdsourced damage assessments for major, sudden onset disasters.

Details →