This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.
Get started using data quickly by viewing all tutorials with associated SageMaker Studio Lab notebooks.
See all usage examples for datasets listed in this registry.
See datasets from EPA, Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Data for Good at Meta, NASA Space Act Agreement, NIH STRIDES, NOAA Open Data Dissemination Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...
encyclopedicinternetnatural language processingweb archive
A corpus of web crawl data composed of over 50 billion web pages.
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...
alchemical free energy calculationsbiomolecular modelingcoronavirusCOVID-19foldingathomehealthlife sciencesmolecular dynamicsproteinSARS-CoV-2simulationsstructural biology
Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. Run by the Folding@home Consortium, a worldwide network of research laboratories focusing on a variety of different diseases, Folding@home seeks to address problems in human health on a scale that is infeasible by another other means, sharing the results of these large-scale studies with the research community through peer-reviewed publications and publicly shared datasets. During the COVID-19 epidemic, Folding@home focused its resources on understanding the vulnerabilities in SARS-CoV-2, the virus that causes COVID-19 disease, and working closely with a number of experimental collaborators to accelerate progress toward effective therapies for treating COVID-19 and ending the pandemic. In the process, it created the world's first exascale distributed computing resource, enabling it to generate valuable scientific datasets of unprecedented size. More information about Folding@home's COVID-19 research activities at the Folding@home COVID-19 page. In addition to working directly with experimental collaborators and rapidly sharing new research findings through preprint servers, Folding@home has joined other researchers in committing to rapidly share all COVID-19 research data, and has joined forces with AWS and the Molecular Sciences Software Institute (MolSSI) to share datasets of unprecedented side through the AWS Open Data Registry, indexing these massive datasets via the MolSSI COVID-19 Molecular Structure and Therapeutics Hub. The complete index of all Folding@home datasets can be found here. Th...
cancergenomiclife sciencesSTRIDESwhole genome sequencing
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...
agriculturedisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in on-going studies. L1C data are available from June 2015 globally. L2A data are available from November 2016 over Europe region and globally since January 2017.
agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
This joint NASA/USGS program provides the longest continuous space-based record of
Earth’s land in existence. Every day, Landsat satellites provide essential information
to help land managers and policy makers make wise decisions about our resources and our environment.
Data is provided for Landsats 1, 2, 3, 4, 5, 7, 8, and 9 (excludes Landsat 6).As of June 28, 2023 (announcement),
the previous single SNS topic arn:aws:sns:us-west-2:673253540267:public-c2-notify
was replaced with
three new SNS topics for different types of scenes.
bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing
This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural vari...
biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy
This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science. The types of data included in this bucket are listed below:
natural language processing
Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National...
bioinformaticscell biologylife sciencessingle-cell transcriptomicstranscriptomics
CZ CELLxGENE Discover (cellxgene.cziscience.com) is a free-to-use platform for the exploration, analysis, and retrieval of single-cell data. CZ CELLxGENE Discover hosts the largest aggregation of standardized single-cell data from the major human and mouse tissues, with modalities that include gene expression, chromatin accessibility, DNA methylation, and spatial transcriptomics. This year, CZ CELLxGENE Discover has made available all of its human and mouse RNA single-cell data through Census (https://chanzuckerberg.github.io/cellxgene-census/) – a free-to-use service with an API and data that...
cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing
The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...
agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imageryweather
NEW GOES-19 Data!! On April 4, 2025 at 1500 UTC, the GOES-19 satellite will be declared the Operational GOES-East satellite. All products and services, including NODD, for GOES-East will transition to GOES-19 data at that time. GOES-19 will operate out of the GOES-East location of 75.2°W starting on April 1, 2025 and through the operational transition. Until the transition time and during the final stretch of Post Launch Product Testing (PLPT), GOES-19 products are considered non-operational regardless of their validation maturity level. Shortly following the transition of GOES-19 to GOES-East, all data distribution from GOES-16 will be turned off. GOES-16 will drift to the storage location at 104.7°W. GOES-19 data should begin flowing again on April 4th once this maneuver is complete.
NEW GOES 16 Reprocess Data!! The reprocessed GOES-16 ABI L1b data mitigates systematic data issues (including data gaps and image artifacts) seen in the Operational products, and improves the stability of both the radiometric and geometric calibration over the course of the entire mission life. These data were produced by recomputing the L1b radiance products from input raw L0 data using improved calibration algorithms and look-up tables, derived from data analysis of the NIST-traceable, on-board sources. In addition, the reprocessed data products contain enhancements to the L1b file format, including limb pixels and pixel timestamps, while maintaining compatibility with the operational products. The datasets currently available span the operational life of GOES-16 ABI, from early 2018 through the end of 2024. The Reprocessed L1b dataset shows improvement over the Operational L1b products but may still contain data gaps or discrepancies. Please provide feedback to Dan Lindsey (dan.lindsey@noaa.gov) and Gary Lin (guoqing.lin-1@nasa.gov). More information can be found in the [GOES-R ABI Reprocess User Guide](https://github.com/NOAA-Big-Data-Program/nodd-data-docs/blob/main/GOES/GOES-R_ABI_Reprocessed_L1b_User_Guide-v1.1.pdf).
NOTICE: As of January 10th 2023, GOES-18 assumed the GOES-West position and all data files are deemed both operational and provisional, so no ‘preliminary, non-operational’ caveat is needed. GOES-17 is now offline, shifted approximately 105 degree West, where it will be in on-orbit storage. GOES-17 data will no longer flow into the GOES-17 bucket. Operational GOES-West products can be found in the GOES-18 bucket.
GOES satellites (GOES-16, GOES-17, GOES-18 & GOES-19) provide continuous weather imagery and
monitoring of meteorological and space environment data across North America.
GO...
agricultureair qualityanalyticsarchivesatmosphereclimateclimate modeldata assimilationdeep learningearth observationenergyenvironmentalforecastgeosciencegeospatialglobalhistoryimagingindustrymachine learningmachine translationmetadatameteorologicalmodelnetcdfopendapradiationsatellite imagerysolarstatisticssustainabilitytime series forecastingwaterweatherzarr
NASA's goal in Earth science is to observe, understand, and model the Earth system to discover how it is changing, to better predict change, and to understand the consequences for life on Earth. The Applied Sciences Program, within the Earth Science Division of the NASA Science Mission Directorate, serves individuals and organizations around the globe by expanding and accelerating societal and economic benefits derived from Earth science, information, and technology research and development.
The Prediction Of Worldwide Energy Resources (POWER) Project, funded through the Applied Sciences Program at NASA Langley Research Center, gathers NASA Earth observation data and parameters related to the fields of surface solar irradiance and meteorology to serve the public in several free, easy-to-access and easy-to-use methods. POWER helps communities become resilient amid observed climate variability by improving data accessibility, aiding research in energy development, building energy efficiency, and supporting agriculture projects.
The POWER project contains over 380 satellite-derived meteorology and solar energy Analysis Ready Data (ARD) at four temporal levels: hourly, daily, monthly, and climatology. The POWER data archive provides data at the native resolution of the source products. The data is updated nightly to maintain near real time availability (2-3 days for meteorological parameters and 5-7 days for solar). The POWER services catalog consists of a series of RESTful Application Programming Interfaces, geospatial enabled image services, and web mapping Data Access Viewer. These three service offerings support data discovery, access, and distribution to the project’s user base as ARD and as direct application inputs to decision support tools.
The latest data version update includes hourly...
agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in ongoing studies. This dataset is the same as the Sentinel-2 dataset, except the JP2K files were converted into Cloud-Optimized GeoTIFFs (COGs). Additionally, SpatioTemporal Asset Catalog metadata has were in a JSON file alongside the data, and a STAC API called Earth-search is freely available t...
bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle
The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute. A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...
agricultureearth observationmeteorologicalnatural resourceweather
Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
agriculturedisaster responseearth observationelevationgeospatial
A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.
cogearth observationgeophysicsgeospatialglobalicenetcdfsatellite imagerystaczarr
The Inter-mission Time Series of Land Ice Velocity and Elevation (ITS_LIVE) project has a singular mission: to accelerate ice sheet and glacier research by producing globally comprehensive, high resolution, low latency, temporally dense, multi-sensor records of land ice and ice shelf change while minimizing barriers between the data and the user. ITS_LIVE data currently consists of NetCDF Level 2 scene-pair ice flow products posted to a standard 120 m grid derived from Landsat 4/5/7/8/9, Sentinel-2 optical scenes, and Sentinel-1 SAR scenes. We have processed all land-ice intersecting image pai...
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The European Space Agency (ESA) WorldCover product provides global land cover maps for 2020 & 2021 at 10 m resolution based on Copernicus Sentinel-1 and Sentinel-2 data. The WorldCover product comes with 11 land cover classes and has been generated in the framework of the ESA WorldCover project, part of the 5th Earth Observation Envelope Programme (EOEP-5) of the European Space Agency. A first version of the product (v100), containing the 2020 map was released in October 2021. The 2021 map was released in October 2022 using an improved algorithm (v200). The WorldCover 2020 and 2021 maps we...
bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies. The gnomAD Principal Investigators and team can be found here, and the groups that have contributed data to the current release are listed here. Sign up for the gnom...
broadbandcoastalContinuously Operating Reference Station (CORS)earthquakesgeophysicsgeosciencegeoscienceGNSSGPSoceansRINEXseismology
GeoNet provides geological hazard information for Aotearoa New Zealand. This dataset contains data and products recorded by the GeoNet sensor network.
GNSS (Global Navigation Satellite System) data include raw data in proprietary and Receiver Independent Exchange Format (RINEX) and local tie-in survey conducted during equipment changes, more details can be found on the GeoNet geodetic page website.
Coastal gauge data include relative measurement of sea level measured by tsunami monitoring gauges. Raw and quality control data are provided in CREX format (Character Form for the Representtion and eXchange of metereological data), more details can be found on the GeoNet coastal tsunami monitoring gauges page.
Camera images data include webcam images from the GeoNet Volcano monitoring network and Built Environment Instrumentation Programme, more details can be found on the GeoNet camera page.
Waveform data include raw data from weak and strong motion instruments of the GeoNet seismic networks, more details can be found on the GeoNet seismic waveform page.
Seismic data products include strong motion derived data, more details can be found on the GeoNet Strong Motion products page.
Time Series data products include derived time...
agricultureclimatemeteorologicalweather
Near Real Time JPSS data is now flowing! See bucket information on the right side of this page to access products!
Satellites in the JPSS constellation gather global measurements of atmospheric, terrestrial and oceanic conditions, including sea and land surface temperatures, vegetation, clouds, rainfall, snow and ice cover, fire locations and smoke plumes, atmospheric temperature, water vapor and ozone. JPSS delivers key observations for the Nation's essential products and services, including forecasting severe weather like hurricanes, tornadoes and blizzards days in advance, and assessin...
computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery
SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).
bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics
The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...
amazon.scienceanalyticsdeep learninggeospatiallast milelogisticsmachine learningoptimizationroutingtransportationurban
The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in r...
coastalcogdeafricaearth observationgeospatialland covernatural resourcesatellite imagerystacsustainability
The Global Mangrove Watch (GMW) dataset is a result of the collaboration between Aberystwyth University (U.K.), solo Earth Observation (soloEO; Japan), Wetlands International the World Conservation Monitoring Centre (UNEP-WCMC) and the Japan Aerospace Exploration Agency (JAXA). The primary objective of producing this dataset is to provide countries lacking a national mangrove monitoring system with first cut mangrove extent and change maps, to help safeguard against further mangrove forest loss and degradation. The Global Mangrove Watch dataset (version 2) consists of a global baseline map of ...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa (DE Africa) provides free and open access to a copy of Landsat Collection 2 Level-2 products over Africa. These products are produced and provided by the United States Geological Survey (USGS). The Landsat series of Earth Observation satellites, jointly led by USGS and NASA, have been continuously acquiring images of the Earth’s land surface since 1972. DE Africa provides data from Landsat 5, 7 and 8 satellites, including historical observations dating back to late 1980s and regularly updated new acquisitions. New Level-2 Landsat 7 and Landsat 8 data are available after 15...
biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience
This data set, made available by Janelia's FlyLight project, consists of fluorescence images of Drosophila melanogaster driver lines, aligned to standard templates, and stored in formats suitable for rapid searching in the cloud. Additional data will be added as it is published.
agriculturecogdisaster responseearth observationgeospatialglobalicesatellite imagerysynthetic aperture radar
Developed and operated by the Canadian Space Agency, it is Canada's first commercial Earth observation satellite Développé et exploité par l'Agence spatiale canadienne, il s'agit du premier satellite commercial d'observation de la Terre au Canada.
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacwater
The Copernicus Global Land Service – Lake Water Quality products offer a comprehensive, satellite-derived monitoring system for assessing key water quality indicators in major large lakes, typically those greater than 50 hectares. These datasets are generated using optical satellite sensors, primarily Sentinel-2 MSI and Sentinel-3 OLCI, with earlier archives derived from Envisat MERIS. Spanning multiple spatial resolutions (100 m and 300 m) and temporal scales (10-day composites), they support both near-real-time and retrospective assessments of inland water quality.Key parameters include surf...
agricultureclimatecogdeafricaearth observationfood securitygeospatialmeteorologicalsatellite imagerystacsustainability
Digital Earth Africa (DE Africa) provides free and open access to a copy of the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) monthly and daily products over Africa. The CHIRPS rainfall maps are produced and provided by the Climate Hazards Center in collaboration with the US Geological Survey, and use both rain gauge and satellite observations. The CHIRPS-2.0 Africa Monthly dataset is regularly indexed to DE Africa from the CHIRPS monthly data. The CHIRPS-2.0 Africa Daily dataset is likewise indexed from the CHIRPS daily data. Both products have been converted to clou...
climatecoastaldeafricaearth observationgeospatialsatellite imagerysustainability
Africa's long and dynamic coastline is subject to a wide range of pressures, including extreme weather and climate, sea level rise and human development. Understanding how the coastline responds to these pressures is crucial to managing this region, from social, environmental and economic perspectives. The Digital Earth Africa Coastlines (provisional) is a continental dataset that includes annual shorelines and rates of coastal change along the entire African coastline from 2000 to the present. The product combines satellite data from the Digital Earth Africa program with tidal modelling t...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop m...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product. Together, they cover all Earth’s land surfaces, large islands, inland and coastal waters every 3-5 days. Sentinel-2 data is tiered by level of pre-processing. Level-0, Level-1A and Level-1B data contain raw data fr...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacwater
Water Observations from Space (WOfS) is a service that draws on satellite imagery to provide historical surface water observations of the whole African continent. WOfS allows users to understand the location and movement of inland and coastal water present in the African landscape. It shows where water is usually present; where it is seldom observed; and where inundation of the surface has been observed by satellite. They are generated using the WOfS classification algorithm on Landsat satellite data. There are several WOfS products available for the African continent including scene-level dat...
Homo sapiensimaginglife sciencesmagnetic resonance imagingneuroimagingneuroscience
This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) In addition to the raw data, preprocessed data is also included for some datasets. A complete list of the available datasets can be seen in the documentation lonk provided below.
aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater
The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2023. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.
cogdisaster responseearth observationgeospatialsatellite imagerystac
Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. These images are generated using the Maxar ARD pipeline, tiled on an organized grid in analysis-ready cloud-optimized formats.
climatecoastaldisaster responseenvironmentalmeteorologicaloceanswaterweather
ANNOUNCEMENTS: [NOS OFS Version Updates and Implementation of Upgraded Oceanographic Forecast Modeling Systems for Lakes Superior and Ontario; Effective October 25, 2022}(https://www.weather.gov/media/notification/pdf2/scn22-91_nos_loofs_lsofs_v3.pdf)
For decades, mariners in the United States have depended on NOAA's Tide Tables for the best estimate of expected water levels. These tables provide accurate predictions of the astronomical tide (i.e., the change in water level due to the gravitational effects of the moon and sun and the rotation of the Earth); however, they cannot predict water-level changes due to wind, atmospheric pressure, and river flow, which are often significant.
The National Ocean Service (NOS) has the mission and mandate to provide guidance and information to support navigation and coastal needs. To support this mission, NOS has been developing and implementing hydrodynamic model-based Operational Forecast Systems.
This forecast guidance provides oceanographic information that helps mariners safely navigate their local waters. This national network of hydrodynamic models provides users with operational nowcast and forecast guidance (out to 48 – 120 hours) on parameters such as water levels, water temperature, salinity, and currents. These forecast systems are implemented in critical ports, harbors, estuaries, Great Lakes and coastal waters of the United States, and form a national backbone of real-time data, tidal predictions, data management and operational modeling.
Nowcasts and forecasts are scientific predictions about the present and future states of water levels (and possibly currents and other relevant oceanographic variables, such as salinity and temperature) in a coastal area. These predictions rely on either observed data or forecasts from a numerical model. A nowcast incorporates recent (and often near real-time) observed meteorological, oceanographic, and/or river flow rate data. A nowcast covers the period from the recent past (e.g., the past few days) to the present, and it can make predictions for locations where observational data are not available. A forecast incorporates meteorological, oceanographic, and/or river flow rate forecasts and makes predictions for times where observational data will not be available. A forecast is usually initiated by the results of a nowcast.
OFS generally runs four times per day (every 6 hours) on NOAA's Weather and Climate Operational Supercomputing Systems (WCOSS) in a standard Coastal Ocean Modeling Framework (COMF) developed by the Center for Operational Oceanographic Products and Services (CO-OPS). COMF is a set...
agriculturecogdisaster responseearth observationgeospatialimagingsatellite imagerystac
Imagery acquired by the China-Brazil Earth Resources Satellite (CBERS), 4 and 4A. The image files are recorded and processed by Instituto Nacional de Pesquisas Espaciais (INPE) and are converted to Cloud Optimized Geotiff format in order to optimize its use for cloud based applications. Contains all CBERS-4 MUX, AWFI, PAN5M and PAN10M scenes acquired since the start of the satellite mission and is daily updated with new scenes. CBERS-4A MUX Level 4 (Orthorectified) scenes are being ingested starting from 04-13-2021. CBERS-4A WFI Level 4 (Orthorectified) scenes are being ingested starting from ...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsynthetic aperture radar
DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications. The Sentinel-1 mission, composed of a constellation of two C-band Synthetic Aperture Radar (SAR) satellites, are operated by European Space Agency (ESA) as part of the Copernicus Programme. The mission currently collects data every 12 days over Africa at a spatial resolution of approximately 20 m. Radar backscatter measures the amount of microwave radiation reflected back to the sensor from the ground surface. This measurement is sensitive to surface rough...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product. Together, they cover all Earth’s land surfaces, large islands, inland and coastal waters every 3-5 days. Sentinel-2 data is tiered by level of pre-processing. Level-0, Level-1A and Level-1B data contain raw data fr...
bioinformaticsgenomiclife scienceslong read sequencing
The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Examples include: NA12878 (HG001) and NA24385 (HG002) sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells; and, UHR RNA (direct-RNA) on an ONT PromethION using the latest RNA004 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for ...
climateearth observationenvironmentalnatural resourceoceanssatellite imagerywaterweather
A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available fro...
aerial imagerycogearth observationgeospatialsatellite imagerystac
The New Zealand Imagery dataset consists of New Zealand's publicly owned aerial and satellite imagery, which is freely available to use under an open licence. The dataset ranges from the latest high-resolution aerial imagery down to 5cm in some urban areas to lower resolution satellite imagery that provides full coverage of mainland New Zealand, Chathams and other offshore islands. It also includes historical imagery that has been scanned from film, orthorectified (removing distortions) and georeferenced (correctly positioned) to create a unique and crucial record of changes to the New Zea...
astronomyobject detectionplanetarysurvey
Raw data that discovers Near Earth Objects (NEOs) which potentially could impact Earth
energyenvironmentalgeospatiallidarmodelsolar
Data released under the Department of Energy's (DOE) Open Energy Data Initiative (OEDI). The Open Energy Data Initiative aims to improve and automate access of high-value energy data sets across the U.S. Department of Energy’s programs, offices, and national laboratories. OEDI aims to make data actionable and discoverable by researchers and industry to accelerate analysis and advance innovation.
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsynthetic aperture radar
The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2). The JERS annual mosaic is generated from images acquired by the SAR sensor on the Japanese Earth Resources Satellite-1 (JERS-1) satellite. This product contains radar measurement in L-band and H...
agriculturecogdeafricaearth observationfood securitygeospatialsatellite imagerystacsustainability
Digital Earth Africa's cropland extent map (2019) shows the estimated location of croplands in Africa for the period January to December 2019. Cropland is defined as: "a piece of land of minimum 0.01 ha (a single 10m x 10m pixel) that is sowed/planted and harvest-able at least once within the 12 months after the sowing/planting date." This definition will exclude non-planted grazing lands and perennial crops which can be difficult for satellite imagery to differentiate from natural vegetation. This provisional cropland extent map has a resolution of 10m, and was built using Cope...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability
Fractional cover (FC) describes the landscape in terms of coverage by green vegetation, non-green vegetation (including deciduous trees during autumn, dry grass, etc.) and bare soil. It provides insight into how areas of dry vegetation and/or bare soil and green vegetation are changing over time. The product is derived from Landsat satellite data, using an algorithm developed by the Joint Remote Sensing Research Program. Digital Earth Africa's FC service has two components. Fractional Cover is estimated from each Landsat scene, providing measurements from individual days. Fractional Cover...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa’s Monthly NDVI Anomaly service provides estimate of vegetation condition, for each caldendar month, against the long-term baseline condition measured for the month from 1984 to 2020 in the NDVI Climatology. A standardised anomaly is calculated by subtracting the long-term mean from an observation of interest and then dividing the result by the long-term standard deviation. Positive NDVI anomaly values indicate vegetation is greener than average conditions, and are usually due to increased rainfall in a region. Negative values indicate additional plant stress relative to t...
bathymetryclimatecoastaldisaster responseelevationfloodsforecastgeospatialhydrologic modelhydrologyinfrastructureland coverland usemappingmeteorologicalmodelopen source softwareprecipitationsimulationssustainabilitywaterweather
Geographic (land cover, land elevation, etc.), meteorologic (pluvial, wind, etc.), hydrologic (fluvial, tidal, etc.), hydrodynamic (water surface elevations, flow velocities), and built environment (structures, levees, floodgates, culverts) data used as inputs to and outputs from numerical modeling software for the prediction of flood risk in stochastic and probabilistic frameworks. This data was collected from open sources, such as from the National Oceanographic and Atmospheric Administration (NOAA) or the United States Geological Survey (USGS). The format of these data is modified to su...
environmentalgeospatialmeteorological
Released to the public as part of the Department of Energy's Open Energy Data Initiative, the Wind Integration National Dataset (WIND) is an update and expansion of the Eastern Wind Integration Data Set and Western Wind Integration Data Set. It supports the next generation of wind integration studies.
array tomographybiologyelectron microscopyimage processinglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneuroscience
This bucket contains multiple neuroimaging datasets (as Neuroglancer Precomputed Volumes) across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include segmentations and meshes.
bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL
COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.
earth observationearthquakesseismology
This dataset contains ground motion velocity and acceleration seismic waveforms recorded by the Southern California Seismic Network (SCSN) and archived at the Southern California Earthquake Data Center (SCEDC). A Distributed Acousting Sensing (DAS) dataset is included.
bioinformaticslife sciencesmetagenomicsopen source softwareproteinprotein folding
The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond ...
agriculturedisaster responseelevationgeospatiallidarstac
The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period. This dataset provides two realizations of the 3DEP point cloud data. The first resource is a public access organization provided in Entwine Point Tiles format, which a lossless, full-density, streamable octree based on LASzip (LAZ) encoding. The second resource is a Requester Pays of the original, Raw LAZ (Compressed LAS) 1.4 3DEP format, and more co...
cogdisaster responseearth observationsatellite imagerystac
Light Every Night - World Bank Nighttime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is...
autonomous vehiclescomputer visionlidarroboticstransportationurban
Public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
ArcticDEM - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2007 to the present. The ArcticDEM project seeks to fill the need for high-resolution time-series elevation data in the Arctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. ArcticDEM data is constructed from in-track and cross-track high...
autonomous vehiclescomputer visionlidarrobotics
This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark.
agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather
High-resolution historical and future climate simulations from 1980-2100
cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing
The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.
earth observationenergygeospatialmeteorologicalwater
Released to the public as part of the Department of Energy's Open Energy Data Initiative, this is the highest resolution publicly available long-term wave hindcast dataset that – when complete – will cover the entire U.S. Exclusive Economic Zone (EEZ).
agricultureagriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa’s NDVI climatology product represents the long-term average baseline condition of vegetation for every Landsat pixel over the African continent. Both mean and standard deviation NDVI climatologies are available for each calender month.Some key features of the product are:
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Electrophysiological recordings of mouse brain activity acquired using Neuropixels probes and accompanying behavioral data.
climateCMIP5natural resourcesustainability
A collection of downscaled climate change projections, derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011]. The NASA Earth Exchange group maintains the NEX-DCP30 (CMIP5), NEX-GDDP (CMIP5), and LOCA (CMIP5).
bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences
The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The project has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility, DNA methylation and mRNA expression across 100s of human cell types and tissues. To see what data is available, please check the directory listing: https://roadmapepigenomics.s3.us-west-2.amazonaws.com/index.html.
biodiversityearth observationecosystemsenvironmentalgeospatialmappingoceans
Water-column sonar data archived at the NOAA National Centers for Environmental Information.
cogearth observationelevationgeospatialstac
The New Zealand Elevation dataset consists of New Zealand's publicly owned digital elevation models and digital surface models, which are freely available to use under an open licence. The dataset contains 1m resolution grids derived from LiDAR data. Point clouds are not included in the initial release.All of the elevation files are Cloud Optimised GeoTIFFs using LERC compression for the main grid and LERC compression with lower max_z_error for the overviews. These elevation files are accompanied by
Usage examples
earth observationearthquakesseismology
This dataset contains various types of digital data relating to earthquakes in central and northern California. Time series data come from broadband, short period, and strong motion seismic sensors, GPS, and other geophysical sensors.
carbonclimateEEIOscope 3spend-based modelssupply chain
CEDA is a multi-regional Environmentally-Extended Input-Output (EEIO) model developed to support a wide range of environmental systems analyses—including corporate carbon accounting and sustainable spend analysis. CEDA provides unparalleled global coverage and granularity, representing 95% of the world's GDP across 148 countries and 400 sectors, enabling robust and geographically comprehensive Scope 3 greenhouse gas (GHG) measurement. Open CEDA is the publicly avaialable version of CEDA, now easy to download and available for free for all use cases. For more information please visit our w...
cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystac
Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
The Reference Elevation Model of Antarctica - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2009 to the present. The REMA project seeks to fill the need for high-resolution time-series elevation data in the Antarctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. REMA data is constructed from in...
bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences
The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.
bioinformaticsfastqgene expressiontranscriptomics
High-throughput transcriptomics (HTTr) data generated by US EPA Office of Research and Development, Center for Computational Toxicology and Exposure (CCTE), Biomolecular and Computational Toxicology Division. All data is generated using TempO-Seq targeted RNA-seq technology from in vitro cell culture systems.
cogearth observationgeospatialminingnatural resourcesatellite imagerysustainability
The Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Level 1 Precision Terrain Corrected Registered At-Sensor Radiance (AST_L1T) data contains calibrated at-sensor radiance, which corresponds with the ASTER Level 1B (AST_L1B), that has been geometrically corrected, and rotated to a north-up UTM projection. The AST_L1T is created from a single resampling of the corresponding ASTER L1A (AST_L1A) product.The precision terrain correction process incorporates GLS2000 digital elevation data with derived ground control points (GCPs) to achieve topographic accuracy for all daytim...
cancergeneticgenomiclife sciencesvcf
Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.
agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather
The sixth phase of global coupled ocean-atmosphere general circulation model ensemble.
cogearth observationgeosciencegeospatialimage processingopen source softwaresatellite imagerystac
Earth observation (EO) data cubes produced from analysis-ready data (ARD) of CBERS-4, Sentinel-2 A/B and Landsat-8 satellite images for Brazil. The datacubes are regular in time and use a hierarchical tiling system. Further details are described in Ferreira et al. (2020).
air qualityatmospherechemistryclimateenvironmentalmeteorologicalmodelweather
Input data for the GEOS-Chem Chemical Transport Model, includes NASA/GMAO MERRA-2 and GEOS-FP meteorological products, chemistry input data, emissions input data, and other smaller datasets such as model initial conditions.
air qualityatmospherechemistryclimateenvironmentalmeteorologicalmodelweather
Input data for nested-grid simulations using the GEOS-Chem Chemical Transport Model. This includes the NASA/GMAO MERRA-2 and GEOS-FP meteorological products, the HEMCO emission inventories, and other small data such as model initial conditions.
disaster responseevents
This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.
bamcancergeneticgenomiclife sciencesvcf
The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.
fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...
chemistrycloud computingdata assimilationdigital assetsdigital preservationenergyenvironmentalfree softwaregenomeHPCinformation retrievalinfrastructurejsonmachine learningmaterials sciencemolecular dynamicsmoleculeopen source softwarephysicspost-processingx-ray crystallography
Materials Project is an open database of computed materials properties aiming to accelerate materials science research. The resources in this OpenData dataset contain the raw, parsed, and build data products.
agricultureagricultureclimatedisaster responseenvironmentaltransportationweather
The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model. Additionally, note that no streamflow or other data assimilation is performed within any of the NWM retrospective simulations
One application of this dataset is to provide historical context to current near real-time streamflow, soil moisture and snowpack conditions. The retrospective data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. This dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.
...
air qualitycitiesenvironmentalgeospatial
Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.
aerial imagerycogdisaster responseearth observationsatellite imagery
OpenAerialMap is a collection of high-resolution openly licensed satellite and aerial imagery.
citiescoastalcogelevationenvironmentallidarurban
This dataset is Lidar data that has been collected by the Scottish public sector and made available under the Open Government Licence. The data are available as point cloud (LAS format or in LAZ compressed format), along with the derived Digital Terrain Model (DTM) and Digital Surface Model (DSM) products as Cloud optimized GeoTIFFs (COG) or standard GeoTIFF. The dataset contains multiple subsets of data which were each commissioned and flown in response to different organisational requirements. The details of each can be found at https://remotesensingdata.gov.scot/data#/list
autonomous vehicleslidarroboticstransportationurban
nuPlan is the world's first large-scale planning benchmark for autonomous driving.
cogearth observationenvironmentalgeospatialland coverland usemachine learningmappingplanetarysatellite imagerystacsustainability
This dataset, produced by Impact Observatory, Microsoft, and Esri, displays a global map of land use and land cover (LULC) derived from ESA Sentinel-2 imagery at 10 meter resolution for the years 2017 - 2023. Each map is a composite of LULC predictions for 9 classes throughout the year in order to generate a representative snapshot of each year. This dataset was generated by Impact Observatory, which used billions of human-labeled pixels (curated by the National Geographic Society) to train a deep learning model for land classification. Each global map was produced by applying this model to ...
autonomous vehiclescomputer visiongeospatiallidarrobotics
Home of the Argoverse datasets.Public datasets supported by detailed maps to test, experiment, and teach self-driving vehicles how to understand the world around them.This bucket includes the following datasets:
calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography
This data ecosystem, Brain Observatory Storage Service & Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.
cogcomputer visionearth observationgeospatialimage processingsatellite imagerystacsynthetic aperture radar
Open Synthetic Aperture Radar (SAR) data from Capella Space. Capella Space is an information services company that provides on-demand, industry-leading, high-resolution synthetic aperture radar (SAR) Earth observation imagery. Through a constellation of small satellites, Capella provides easy access to frequent, timely, and flexible information affecting dozens of industries worldwide. Capella's high-resolution SAR satellites are matched with unparalleled infrastructure to deliver reliable global insights that sharpen our understanding of the changing world – improving decisions ...
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.
coastalfloods
The Virginia Department of Conservation and Recreation’s Office of Resilience Planning maintains this public file repository to provide access to flood resilience open data products. The repository is designed to house public data produced for the Virginia Coastal Resilience Master Plan (CRMP), Virginia Flood Protection Master Plan (VFPMP), and other purposes. At present, the repository hosts only data products produced for the CRMP Phase II (2025) and Phase I (2021).
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacwater
The Digital Earth Africa continental Waterbodies Monitoring Service identifies more than 700,000 water bodies from over three decades of satellite observations. This service maps persistent and seasonal water bodies and the change in their water surface area over time. Mapped water bodies may include, but are not limited to, lakes, ponds, man-made reservoirs, wetlands, and segments of some river systems.On a local, regional, and continental scale, this service helps improve our understanding of surface water dynamics and water availability and can be used for monitoring water bodies such as we...
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
EarthDEM - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2002 to the present. The EarthDEM project seeks to fill the need for high-resolution time-series elevation data in non-polar regions. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. EarthDEM data is constructed from in-track and cross-track ...
electricityenergyenvironmentalgeospatialsupply chainsustainabilitytransportation
GeoJSON files for the MIT Climate & Sustainability Consortium's Geospatial Trucking Industry Decarbonization Explorer
driftersEulerianHYCOMLagrangiannumerical particleocean circulationocean currentsocean sea surface heightocean simulationocean velocityoceans
A combined dataset of simulated ocean sea surface height, near-surface velocities, and particle trajectories from a global 1/25th degree HYbrid Coordinate Ocean Model (HYCOM) 1-year run.
globaloceans
Global Ocean Forecasting System (GOFS) 3.1 output on the GLBv0.08 grid. The resolution is 0.08° resolution between 40°S and 40°N, 0.04° poleward of these latitudes. The temportal frequenct is 3 hourly. This data was created by the Naval Research Laboratory: Ocean Dynamics and Prediction Branch.
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Behavioral data of mice performing a decision-making task, associated with 2020 publication of the IBL.
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Electrophysiological recordings acquired using Neuropixels probes in different mice and labs, targeting the same brain locations (including posterior parietal cortex, hippocampus, and thalamus).
agricultureclimatemeteorologicalweather
The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in 2026. The operational configuration will feature a 3 km grid covering North America and include deterministic forecasts every hour out to 18 hours, with deterministic and ensemble forecasts to 60 hours four times per day at 00, 06, 12, and 18 UTC.The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.
The RRFS is underpinned by the Unified Forecast System (UFS), a community-based Earth modeling initiative, and benefits from collaborative development efforts across NOAA, academia, and research institutions.
This bucket provides access to real time, experimental RRFS prototype output. And will provide access to final retrospective output once completed.
rrfs_a/rrfs_a.20241201/12/control
contains the deterministic forecast initialized at 12 UTC on 01 December 2024. Users will find two types of output in GRIB2 format. The first is:rrfs.t12z.natlev.f018.grib2
rrfs.t12z.prslev.f018.conus.grib2
rrfs.t00z.prslev.f002.grib2
Alaska: rrfs.t00z.prslev.f002.ak.grib2
Hawaii: rrfs.t00z.prslev.f002.hi.grib2
Puerto Rico: rrfs.t00z.prslev.f002.pr.grib2
rrfs_a/rrfs_a.20231214/00/mem0001
contains the forecast from member 1, and rrfs_a/rrfs_a.20231214/00/enspost_timelag
...
biologyhealthimage processingimaginglife sciencesmagnetic resonance imagingneurobiologyneuroimaging
This dataset contains deidentified raw k-space data and DICOM image files of over 1,500 knees and 6,970 brains.
citiestransportationurban
Data of trips taken by taxis and for-hire vehicles in New York City. Note: access to this dataset is free, however direct S3 access does require an AWS account. Anonymous downloads are accessible from the dataset's documentation webpage listed below.
bioinformaticsbiologygeneticgenomiclife sciencesreference index
This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...
disaster responseearth observationearthquakes
Grillo has developed an IoT-based earthquake early-warning system, with sensors currently deployed in Mexico, Chile, Puerto Rico and Costa Rica, and is now opening its entire archive of unprocessed accelerometer data to the world to encourage the development of new algorithms capable of rapidly detecting and characterizing earthquakes in real time.
acousticsbiodiversitybiologyclimatecoastaldeep learningecosystemsenvironmentalmachine learningmarine mammalsoceansopen source software
This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications in ocean soundscape research, education, and the arts.
geospatialgeothermalimage processingseismology
Released to the public as part of the Department of Energy's Open Energy Data Initiative, these data represent vertical and horizontal distributed acoustic sensing (DAS) data collected as part of the Poroelastic Tomography (PoroTomo) project funded in part by the Office of Energy Efficiency and Renewable Energy (EERE), U.S. Department of Energy.
computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery
RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsynthetic aperture radar
Synthetic Aperture Radar (SAR) sensor have the advantage of operating at wavelengths not impeded by cloud cover and can acquire data over a site during the day or night. The Sentinel-1 mission, part of the Copernicus joint initiative by the European Commission (EC) and the European Space Agency (ESA), provides reliable and repeated wide-area monitoring using its SAR instrument.Sentinel-1 Monthly Mosaics are analysis-ready product of individual Sentinel-1 acquisitions. Sentinel-1 monthly mosaics are generated from Radiometric Terrain Corrected (RTC) backscatter data, with variations from changi...
bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus
Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.
machine learningNASA SMD AI
The v1 dataset includes AIA/HMI observations 2010-2018 and v2 includes AIA/HMI observations 2010-2020 in all 10 wavebands (94A, 131A, 171A, 193A, 211A, 304A, 335A, 1600A, 1700A, 4500A), with 512x512 resolution and 6 minutes cadence; HMI vector magnetic field observations in Bx, By, and Bz components, with 512x512 resolution and 12 minutes cadence; The EVE observations in 39 wavelengths from 2010-05-01 to 2014-05-26, with 10 seconds cadence.
cloud computingdatacenterenergyHPCworkload analysis
Collection of parsed datacenter logs and time series data of hardware utilization from the MIT Supercloud system.
censusdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousinghousing unitslatinonoisy measurementspopulationraceredistrictingvoting age
The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-06-30) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9 , and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations, that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File. The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the 2010 Demonstration Data Products Suite – Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File – Production Settings (2023-04-03). These statistical queries, called “noisy measurements” were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016] https://arxiv.org/abs/1605.02065; see also Dwork C. and Roth, A. [2014] https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] https://arxiv.org/abs/2004.00010), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-04-03) has been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004).
The 2010 Census Production Settings Demographic and Housing Characteristics Demonstration Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/04-Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census.
The NMF provides estimates ...
censusdifferential privacydisclosure avoidanceethnicitygroup quartershousinghousing unitsnoisy measurementspopulationraceredistrictingvoting age
The 2020 Census Demographic and Housing Characteristics Noisy Measurement File is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022], and implemented in primitives.py). The 2020 Census Demographic and Housing Characteristics Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] ), which added positive or negative integer-valued noise to each of the resulting counts. These are estimated counts of individuals and housing units included in the 2020 Census Edited File (CEF), which includes confidential data collected in the 2020 Census of Population and Housing.
The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the Census Demographic and Housing Characteristics Summary File. In addition to the noisy measurements, constraints based on invariant calculations --- counts computed without noise --- are also included (with the exception of the state-level total populations, which can be sourced separately from data.census.gov).
The Noisy Measurement File was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File.
The noisy measurements are p...
agriculturefood securitygeneticgenomiclife sciences
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
agriculturecogdisaster responseearth observationgeospatialimagingsatellite imagerystacsustainability
Imagery acquired by Amazonia-1 satellite. The image files are recorded and processed by Instituto Nacional de Pesquisas Espaciais (INPE) and are converted to Cloud Optimized Geotiff format in order to optimize its use for cloud based applications. WFI Level 4 (Orthorectified) scenes are being ingested daily starting from 08-29-2022, the complete Level 4 archive will be ingested by the end of October 2022.
chemical biologychemistryclimatedatacenterdigital assetsgeochemistrygeophysicsgeosciencemarinenetcdfoceans
Argo is an international program to observe the interior of the ocean with a fleet of profiling floats drifting in the deep ocean currents (https://argo.ucsd.edu). Argo GDAC is a dataset of 5 billion in situ ocean observations from 18.000 profiling floats (4.000 active) which started 20 years ago. Argo GDAC dataset is a collection of 18.000 NetCDF files. It is a major asset for ocean and climate science, a contributor to IOCCP reports.
biologycell biologycomputer visionelectron microscopyimaginglife sciencesmicroscopysegmentation
The Automated Segmentation of intracellular substructures in Electron Microscopy (ASEM) project provides deep learning models trained to segment structures in 3D images of cells acquired by Focused Ion Beam Scanning Electron Microscopy (FIB-SEM). Each model is trained to detect a single type of structure (mitochondria, endoplasmic reticulum, golgi apparatus, nuclear pores, clathrin-coated pits) in cells prepared via chemically-fixation (CF) or high-pressure freezing and freeze substitution (HPFS). You can use our open source pipeline to load a model and predict a class of sub-cellular structur...
atmosphereclimateclimate modeldata assimilationforecastgeosciencegeospatiallandmeteorologicalweatherzarr
This is a cloud-hosted subset of the CAM6+DART (Community Atmosphere Model version 6 Data Assimilation Research Testbed) Reanalysis dataset. These data products are designed to facilitate a broad variety of research using the NCAR CESM 2.1 (National Center for Atmospheric Research's Community Earth System Model version 2.1), including model evaluation, ensemble hindcasting, data assimilation experiments, and sensitivity studies. They come from an 80 member ensemble reanalysis of the global troposphere and stratosphere using DART and CAM6. The data products represent states of the atmospher...
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the all data for the CAncer MEtastases in LYmph nOdes challeNge or CAMELYON. CAMELYON was the first challenge using whole-slide images in computational pathology and aimed to help pathologists identify breast cancer metastases in sentinel lymph nodes. Lymph node metastases are extremely important to find, as they indicate that the cancer is no longer localized and systemic treatment might be warranted. Searching for these metastases in H&E-stained tissue is difficult and time-consuming and AI algorithms can play a role in helping make this faster and more accura...
climateclimate modelclimate projectionsCMIP6ocean circulationocean currentsocean sea surface heightocean simulationocean velocity
This dataset provides several global fields describing the state of atmosphere, ocean, land and ice from a high-resolution (0.1o for the ocean/ice models 0.25o for the land/atmosphere models) numerical earth system model, the Community Earth System Model (CESM, https://www.cesm.ucar.edu/). Texas A&M University (TAMU) and National Center for Atmospheric Research together with international partners collaboratively carried out a large set of high-resolution climate simulations, including a 500-year long preindustrial control simulation (PI-CTRL) described here. The CESM uses dynamic equation...
air qualityclimateenvironmentalgeospatialmeteorological
CMAS Data Warehouse on AWS collects and disseminates meteorology, emissions and air quality model input and output for Community Multiscale Air Quality (CMAQ) Model Applications. This dataset is available as part of the AWS Open Data Program, therefore egress fees are not charged to either the host or the person downloading the data. This S3 bucket is maintained as a public service by the University of North Carolina's CMAS Center, the US EPA’s Office of Research and Development, and the US EPA’s Office of Air and Radiation. Metadata and DOIs for datasets included in the CMAS Data Wareho...
bambioinformaticsbiologyCaenorhabditis elegansfastqgatk-svgenetic mapsgenomegenome wide association studygenomiclife sciencesshort read sequencingvariant annotationvcf
The Caenorhabditis Natural Diversity Resource (CaeNDR) is a data repository and analysis hub of wild strains of selfing Caenhorabditis species C. elegans, C. briggsae, and C. tropicalis from around the world to facilitate discovery of genetic variation across all three species through genome-wide association mappings to correlate genotype with phenotype and identify genetic variation underlying quantitative traits.
cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing
The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...
atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainabilityzarr
The Community Earth System Model (CESM) Large Ensemble Numerical Simulation (LENS) dataset includes a 40-member ensemble of climate simulations for the period 1920-2100 using historical data (1920-2005) or assuming the RCP8.5 greenhouse gas concentration scenario (2006-2100), as well as longer control runs based on pre-industrial conditions. The data comprise both surface (2D) and volumetric (3D) variables in the atmosphere, ocean, land, and ice domains. The total data volume of the original dataset is ~500TB, which has traditionally been stored as ~150,000 individual CF/NetCDF files on disk o...
disaster responsegeospatialmappingosm
Daylight is a complete distribution of global, open map data that’s freely available with support from community and professional mapmakers. Meta combines the work of global contributors to projects like OpenStreetMap with quality and consistency checks from Daylight mapping partners to create a free, stable, and easy-to-use street-scale global map.
The Daylight Map Distribution contains a validated subset of the OpenStreetMap database. In addition to the standard OpenStreetMap PBF format, Daylight is available in two parquet formats that are optimized for AWS Athena including geometries (Points, LineStrings, Polygons, or MultiPolygons). First, Daylight OSM Features contains the nearly 1B renderable OSM features. Second, Daylight OSM Elements contains all of OSM, including all 7B nodes without attributes, and relations that do not contain geometries, such as turn restrictions.
Daylight ...
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The WorldCover 10m Annual Composites were produced, as part of the European Space Agency (ESA) WorldCover project, from the yearly Copernicus Sentinel-1 and Sentinel-2 archives for both years 2020 and 2021. These global mosaics consists of four products composites. A Sentinel-2 RGBNIR yearly median composite for bands B02, B03, B04, B08. A Sentinel-2 SWIR yearly median composite for bands B11 and B12. A Sentinel-2 NDVI yearly percentiles composite (NDVI 90th, NDVI 50th NDVI 10th percentiles). A Sentinel-1 GAMMA0 yearly median composite for bands VV, VH and VH/VV (power scaled). Each product is...
citiesclimateenergyenergy modelinggeospatialmetadatamodelopen source softwaresustainabilityutilities
The U.S. Department of Energy (DOE) funded a three-year project, End-Use Load Profiles for the U.S. Building Stock, that culminated in this publicly available dataset of calibrated and validated 15-minute resolution load profiles for all major residential and commercial building types and end uses, across all climate regions in the United States. These EULPs were created by calibrating the ResStock and ComStock physics-based building stock models using many different measured datasets, as described here. This dataset includes load profiles for both the baseline building stock and the building ...
agriculturecogdisaster responseelevationgeospatialhydrologysatellite imagerystac
Height Above Nearest Drainage (HAND) is a terrain model that normalizes topography to the relative heights along the drainage network and is used to describe the relative soil gravitational potentials or the local drainage potentials. Each pixel value represents the vertical distance to the nearest drainage. The HAND data provides near-worldwide land coverage at 30 meters and was produced from the 2021 release of the Copernicus GLO-30 Public DEM as distributed in the Registry of Open Data on AWS.
agriculturecogearth observationearthquakesecosystemsenvironmentalgeologygeophysicsgeospatialglobalinfrastructuremappingnatural resourcesatellite imagerysynthetic aperture radarurban
This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degrees southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASA's Jet Propulsion Laboratory. ...
agriculturecogdeep learninglabeledland covermachine learningsatellite imagery
High resolution, annual cropland and landcover maps for selected African countries developed by Clark University's Agricultural Impacts Research Group using various machine learning approaches applied to Planet imagery, including field boundary and cultivated frequency maps, as well as multi-class land cover.
agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imageryweather
Himawari-9, stationed at 140.7E, owned and operated by the Japan Meteorological Agency (JMA), is a geostationary meteorological satellite, with Himawari-8 as on-orbit back-up, that provides constant and uniform coverage of east Asia, and the west and central Pacific regions from around 35,800 km above the equator with an orbit corresponding to the period of the earth’s rotation. This allows JMA weather offices to perform uninterrupted observation of environmental phenomena such as typhoons, volcanoes, and general weather systems. Archive data back to July 2015 is available for Full Disk (AHI-L...
aerial imageryearth observationelevationgeospatiallidar
The KyFromAbove initiative is focused on building and maintaining a current basemap for Kentucky that can meet the needs of its users at the state, federal, local, and regional level. A common basemap, including current color leaf-off aerial photography and elevation data (LiDAR), reduces the cost of developing GIS applications, promotes data sharing, and add efficiencies to many business processes. All basemap data acquired through this effort is being made available in the public domain. KyFromAbove acquires aerial imagery and LiDAR during leaf-off conditions in the Commonwealth. The imagery...
cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging
This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...
elevationlidarplanetarystac
The lunar orbiter laser altimeter (LOLA) has collected and released almost 7 billion individual laser altimeter returns from the lunar surface. This dataset includes individual altimetry returns scraped from the Planetary Data System (PDS) LOLA Reduced Data Record (RDR) Query Tool, V2.0. Data are organized in 15˚ x 15˚ (longitude/latitude) sections, compressed and encoded into the Cloud Optimized Point Cloud (COPC) file format, and collected into a Spatio-Temporal Asset Catalog (STAC) collection for query and analysis. The data are in latitude, longitude, and radius (X, Y, Z) format with the p...
bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
The National Air Quality Forecasting Capability (NAQFC) dataset contains model-generated air quality (AQ) forecast guidance from three different prediction systems. The first system is a coupled weather and atmospheric chemistry numerical forecast model, known as the Air Quality Model (AQM). It is used to produce forecast guidance for ozone (O3) and particulate matter that is less than or equal to 2.5 micrometers in diameter (PM2.5). Prior to May 14, 2024, AQM predictions were derived using the EPA’s Community Multiscale Air Quality (CMAQ) model, driven by meteorological fields from NCEP’s operational weather forecast models, specifically the North American Mesoscale Model (NAM; prior to 20 July 2021) and the Global Forecast System (GFS; beginning 20 July 2021). Since May 14, 2024, AQM guidance has been produced by a unique application within the community-based Unified Forecast System (UFS). The core model components in this application are derived directly from the fully online-coupled UFS-based weather and CMAQ-based chemistry models. In addition, it incorporates information related to chemical and particle source emissions as it integrates forward in time, including anthropogenic chemical emissions provided by the EPA, fire emissions from NOAA/NESDIS, and airborne particles generated by human activities and those predicted to be generated by wind-driven erosion and biosphere at ground level. The NCEP NAQFC AQM output fields in this archive include model raw and bias-corrected predictions dating back to 1 January 2020, all generated by the contemporaneous operational AQM, beginning with AQMv5 in 2020, transitioning to AQMv6 on 20 July 2021, and to AQMv7 on 14 May 2024. The length of each forecast was 48 hours prior to the implementation of AQMv6, and has been 72 hours ever since. The history of AQM upgrades is documented here
The second prediction is known as the Hybrid Single-Particle Lagrangian Integrated Trajectory model (HYSPLIT). It is a widely used atmospheric transport and dispersion model containing an internal dust-generation module. It provides forecast guidance for atmospheric dust concentration and, prior to 28 June 2022, it also provided the NAQFC forecast guidance for smoke. Starting on that date, the third prediction system, a regional numerical weather prediction (NWP) model known as the Rapid Refresh (RAP) model, subsumed HYSPLIT for operational smoke guidance, simulating the emission, transport, and deposition of smoke particles that originate from biomass burning (fires) and anthropogenic sources.
The output from each of these modeling systems is generated over three separate domains, one covering CONUS, another over Alaska, and the other over Hawaii. Currently, for this archive, the O3, PM2.5, and smoke output is available over all three domains, while dust products are available only over the CONUS domain. The predicted concentrations of all species in the lowest model layer (i.e., the layer in contact with the surface) are available, as are vertically integrated values of smoke and dust. The data is gridded horizontally within each domain, with a grid spacing of approximately 5 km over CONUS, 6 km over Alaska, and 2.5 km over Hawaii. O3 concentrations are provided in parts per billion (PPB), while the concentrations of all other species are quantified in units of micrograms per cubic meter (ug/m3), except for the column-integrated smoke values which are expressed in units of milligrams per square meter (mg/m2).
Temporally, O3 and PM2.5 are available as maximum and/or averaged values over various time periods, selected in part for consistency with the EPA’s National Ambient Air Quality Standards. Specifically, O3 is available in both 1-hour and 8-hour (backward calculated) averages, as well as preceding 1-hour and 8-hour maximum values. Similarly, PM2.5 is available in 1-hour and 24-hour average values and 24-hour maximum values. In addition, all O3 and PM2.5 fields are available with bias-corrected magnitudes, based on derived historical model biases relative to observations.
The AQM produ...
coastalcogearth observationelevationgeospatiallidarstac
The New Zealand Coastal Elevation dataset consists of New Zealand's publicly owned coastal digital elevation models, which are freely available to use under an open licence. The data consists of bare earth (DEM) data that traverses the coastal zone, including the seabed down to approximately 25m in depth. Data is provided as nationally consistent 1m resolution tiles derived from LiDAR surveys.All of the coastal elevation files are Cloud Optimised GeoTIFFs using LERC compression for the main grid and LERC compression with lower max_z_error for the overviews. These elevation files are accomp...
earth observationgeospatialsatellite imageryurban
NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales.
agricultureclimateearth observationmeteorologicalweather
Open-Meteo integrates weather models from reputable national weather services, offering a swift and efficient weather API. Real-time weather forecasts are unified into a time-series database that provides historical and future weather data for any location worldwide.Through Open-Meteo on AWS Open Data, you can download the Open-Meteo weather database and analysis weather data locally. Docker images are provided to download data and to expose an HTTP API endpoint. Using Open-Meteo SDKs, you can seamlessly integrate weather data into your Python, Typescript, Swift, Kotlin, or Java applications.T...
disaster responsegeospatialmappingosm
OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3 in both standard formats (OSM PBF, XML) and cloud-native formats optimized for analytics workloads.
geospatialglobalmappingosmparquettransportation
Overture is a collaboratively built, global, open map data project for developers who build map services or use geospatial data. Overture Open Map Data contains data that are standardized under the themes of Admins, Base, Buildings, Places, and Transportation. Overture also includes a Global Entity Reference System (GERS) which encodes map data to a shared universal reference. Beginning with the Overture 2023-11-14-alpha.0 release, the data is available as cloud-native GeoParquet files.
air qualityatmosphereearth observationenvironmentalgeospatialsatellite imagery
NO2 tropospheric column density, screened for CloudFraction < 30% global daily composite at 0.25 degree resolution for the temporal range of 2004 to May 2020. Original archive data in HDF5 has been processed into a Cloud-Optimized GeoTiff (COG) format. Quality Assurance - This data has been validated by the NASA Science Team at Goddard Space Flight Center.Cautionary Note: https://airquality.gsfc.nasa.gov/caution-interpretation.
citieselevationgeospatiallandlidarmappingurban
The objective of the Mapa 3D Digital da Cidade (M3DC) of the São Paulo City Hall is to publish LiDAR point cloud data. The initial data was acquired in 2017 by aerial surveying and future data will be added. This publicly accessible dataset is provided in the Entwine Point Tiles format as a lossless octree, full density, based on LASzip (LAZ) encoding.
autonomous racingautonomous vehiclescomputer visionGNSSimage processinglidarlocalizationobject detectionobject trackingperceptionradarrobotics
The RACECAR dataset is the first open dataset for full-scale and high-speed autonomous racing. Multi-modal sensor data has been collected from fully autonomous Indy race cars operating at speeds of up to 170 mph (273 kph). Six teams who raced in the Indy Autonomous Challenge during 2021-22 have contributed to this dataset. The dataset spans 11 interesting racing scenarios across two race tracks which include solo laps, multi-agent laps, overtaking situations, high-accelerations, banked tracks, obstacle avoidance, pit entry and exit at different speeds. The data is organized and released in bot...
bioinformaticselectrophysiologylife sciencesmicroscopyneurophysiologyneuroscience
The SPARC Datasets comprise a collection of scientific data that is focused on bridging the body and the brain. The datasets focus on neural connectivity, organ innervation and detailed anatomical mapping of the peripheral nervous system. SPARC datasets distinguish themselves from other data resources through its multi-modal approach to scientific data and integrates molecular, imaging, timeseries and other datatypes associated with the interaction between the peripheral nervous system and organs. SPARC data provides a unique integrated effort to develop next generation mapping of anatomical ...
air qualityenvironmental
SPARTAN (Surface PARTiculate mAtter Network) measures and provides surface ambient particulate matter (PM2.5 and PM10) concentration and the chemical composition around the world, with the purpose of connecting ground-based PM2.5 and satellite remote sensing.
archivescitiescomputer visionconservationcultural preservationculturedemographicsdigital assetsgeospatialhistoryhousingland usemappingurban
The dataset contains metadata records for 50,600 maps from the Sanborn Fire Insurance Maps collection and their corresponding 440,048 JPEG images. The Sanborn collection at Library of Congress includes over fifty thousand editions of fire insurance maps comprising almost seven hundred thousand individual sheets. The Library of Congress holdings represent the largest extant collection of maps produced by the Sanborn Map Company.
biodiversityecosystemsfisheriesmarine
The project presents Sea Around Us Global Fisheries Catch Data aggregated at EEZ level. The data are computed from reconstructed catches from various official fisheries statistics, scientific, technical and policy reports about the fisheries, and includes estimation of discards, unreported and illegal catch data from all maritime countries and major territories of the world.This project was the result of a work between Sea Around Us and the CIC programme, a collaborative programme between the University of British Columbia (UBC) and AWS.
climateenvironmentalmeteorologicaloceansoceanssustainabilityweather
This dataset includes archival hourly data from the [Sofar Spotter buoy global network] (https://weather.sofarocean.com/) from 2019 to March 2022.
climateenvironmentalGPSweather
SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself. Currently 313 receiver stations are providing data for an average of 384 radiosondes a day. The data within this repository contains received telemetry frames, including radiosonde type, gps position, a...
biologyimaginglife sciencesneurobiologyneuroimagingneuroscience
The Human Connectome Project (HCP Young Adult, HCP-YA) is mapping the healthy human connectome by collecting and freely distributing neuroimaging and behavioral data on 1,200 normal young adults, aged 22-35.
aerial imageryearth observationelevationgeospatialland coverlidar
The State of Vermont has partnered with Amazon's Open Data Initative to make a wide range of geospatial data available in the public domain. Vermont acquires aerial imagery and LiDAR during leaf-off conditions. The imagery typically ranges from 30-centimeter to 15-centimeter in resolution and is available from Vermont's Amazon S3 bucket in a Cloud Optimized GeoTiff (COG) format. LiDAR data has been acquired and is available as USGS Quality Level-1 (QL1) and Level-2 (QL2) compliant datasets in COG format. Geospatial datasets derived from imagery and/or lidar are also available as COGs, ...
agriculturecoglabeledland covermachine learningsatellite imagery
Crop field boundaries digitized in Planet imagery collected across Africa between 2017 and 2023, developed by Farmerline, Spatial Collective, and the Agricultural Impacts Research Group at Clark University, with support from the Lacuna Fund (Estes et al, 2024; Details →
bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics
A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).
atmosphereclimatedeep learningenvironmentalexplorationgeophysicsgeosciencegeospatialglobaliceplanetarysatellite imageryzarr
The Chalmers Cloud Ice Climatology (CCIC) is a novel, deep-learning-based climate record of ice-particle concentrations in the atmosphere. CCIC results are available at high spatial and temporal resolution (0.07° / 3 h from 1983, 0.036° / 30 min from 2000) and thus ideally suited for evaluating high-resolution weather and climate models or studying individual weather systems.
atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainabilityzarr
The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021. NCAR has copied a subset (currently ~500 TB) of CESM2 LENS data to Amazon S3 as part of the AWS Public Datasets Program. To optimize for large-scale analytics we have represented ...
data assimilationelectricityenergyenergy modelingindustrialmeteorologicalsolartransportation
Projects that use the dsgrid toolkit assemble bottom-up descriptions of electricity demand and related data that are highly resolved geographically, temporally, and sectorally. Typically modelers describe multiple scenarios of future energy use at hourly resolution, suitable for inclusion in long-term power system planning models, i.e., capacity expansion and production cost models.
air temperatureatmospheremeteorologicalnear-surface air temperaturenear-surface relative humiditynear-surface specific humidityprecipitationweather
These products are a subset of the ECMWF real-time forecast data and are made available to the public free of charge. They are based on the medium-range (high-resolution and ensemble) and seasonal forecast models. Products are available at 0.4 degrees resolution in GRIB2 format unless stated otherwise.
atmosphereclimateearth observationglobalsignal processingweather
This is an updating archive of radio occultation (RO) data using the transmitters of the Global Navigation Satellite Systems (GNSS) as generated and processed by the COSMIC DAAC (ucar), the Jet Propulsion Laboratory (jpl) of the California Institute of Technology, and the Radio Occultation Meteorology Satellite Application Facility (romsaf). The contributions for ucar and romsaf are currently active.
This dataset is funded by the NASA Earth Science Data Systems and the Advancing Collaborative Connections for Earth System Science (ACCESS) 2019 program.
bioinformaticsbiologygeneticgenomiclife sciences
The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a ...
geopackagehydrographyhydrologic modelhydrologysimulationszarr
GEOGLOWS is the Group on Earth Observation's Global Water Sustainability Program. It coordinates efforts from public
and private entities to make application ready river data more accessible and sustainably available to underdeveloped
regions. The GEOGLOWS Hydrological Model provides a retrospective and daily forecast of global river discharge at 7
million river sub-basins. The stream network is a hydrologically conditioned subset of the TDX-Hydro streams and
basins data produced by the United State's National Geospatial Intelligence Agency. The daily forecast provides 3
hourly average discharge in a 51 member ensemble and 15 day lead time derived from the ECMWF Integrated Forecast
System (IFS). The retrospective simulation is derived from ERA5 climate reanalysis data and provides daily average
streamflow beginning on 1 January 1940. New forecasts are uploaded daily and the retrospective simulation is updated
weekly on Sundays to keep the lag time between 5 and 12 days.
The geoglows-v2 bucket contains: (1) model configuration files used to generate the simulations, (2) the GIS streams
datasets used by the model, (3) the GIS streams datasets optimized for visualizations used by Esri's Living Atlas
layer, (4) several supporting table of metadata including country names, river names, hydrological properties used for
modeling.
The geoglows-v2-forecasts bucket contains: (1) daily 15 forecasts in zarr format optimized for time series queries of
all ensemble members in the prediction, (2) CSV formatted summary files optimized for producing time series animated
web maps for the entire global streams dataset.
The geoglows-v2-retrospective bucket contains: (1) the model retrospective outputs in (1a) zarr format optimized for
time series queries of up to a few hundred rivers on demand as well as (1b) in netCDF format best for bulk download...
geneticgenomiclife sciencesreference indexvcf
Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.
aerial imageryagricultureclimatecogearth observationgeospatialimage processingland covermachine learningsatellite imagery
Global and regional Canopy Height Maps (CHM). Created using machine learning models on high-resolution worldwide Maxar satellite imagery.
agricultureearth observationmeteorologicalnatural resourceweather
Historical and one-day delay data from the IDEAM radar network.
cogelevationplanetarystac
The Japan Aerospace EXploration Agency (JAXA) SELenological and ENgineering Explorer (SELENE) mission’s Kaguya spacecraft was launched on September 14, 2007 and science operations around the Moon started October 20, 2007. The primary mission in a circular polar orbit 100-km above the surface lasted from October 20, 2007 until October 31, 2008. An extended mission was then conducted in lower orbits (averaging 50km above the surface) from November 1, 2008 until the SELENE mission ended with Kaguya impacting the Moon on June 10, 2009. These data are digital terrain models derived using the NASA A...
cogplanetarysatellite imagerystac
The Japan Aerospace EXploration Agency (JAXA) SELenological and ENgineering Explorer (SELENE) mission’s Kaguya spacecraft was launched on September 14, 2007 and science operations around the Moon started October 20, 2007. The primary mission in a circular polar orbit 100-km above the surface lasted from October 20, 2007 until October 31, 2008. An extended mission was then conducted in lower orbits (averaging 50km above the surface) from November 1, 2008 until the SELENE mission ended with Kaguya impacting the Moon on June 10, 2009. These data were collected in monoscopic observing mode. To cre...
air temperatureatmosphereforecastgeosciencegeospatialglobalmeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction (NWP) model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. The global ensemble (MOGREPS-G) produces forecasts for the whole globe up to a week ahead. The projection used is the Equirectangular Latitude-Longitude and the grid resolution is 20km. The forecast is updated regularly with a time delay and formatted via NetCDF. Please ...
air temperatureatmosphereforecastgeosciencegeospatialglobalmeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction (NWP) model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. The regional ensemble (MOGREPS-UK) produces forecasts for an area covering the UK for the next five days. In the UK ensemble the model parameters (temperature, pressure, wind, humidity, etc.) are forecast at grid points separated by about 2.2 km, and the model has 70 ver...
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...
biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience
This data set, made available by Janelia's MouseLight project, consists of images and neuron annotations of the Mus musculus brain, stored in formats suitable for viewing and annotation using the HortaCloud cloud-based annotation system.
atmosphereclimateclimate modelgeospatiallandmodelsustainabilityzarr
The NA-CORDEX dataset contains regional climate change scenario data and guidance for North America, for use in impacts, decision-making, and climate science. The NA-CORDEX data archive contains output from regional climate models (RCMs) run over a domain covering most of North America using boundary conditions from global climate model (GCM) simulations in the CMIP5 archive. These simulations run from 1950–2100 with a spatial resolution of 0.22°/25km or 0.44°/50km. This AWS S3 version of the data includes selected variables converted to Zarr format from the original NetCDF. Only daily data a...
aerial imageryagriculturecogearth observationgeospatialnatural resourceregulatory
The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This "leaf-on" imagery andtypically ranges from 30 centimeters to 100 centimeters in resolution and is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeoTiff format. More details on NAIP
cogplanetarysatellite imagerystac
Knowledge of a planetary surface’s topography is necessary to understand its geology and enable landed mission operations. The Solid State Imager (SSI) on board NASA’s Galileo spacecraft acquired more than 700 images of Jupiter’s moon Europa. Although moderate- and high-resolution coverage is extremely limited, repeat coverage of a small number of sites enables the creation of digital terrain models (DTMs) via stereophotogrammetry. Here we provide stereo-derived DTMs of five sites on Europa. The sites are the bright band Agenor Linea, the crater Cilix, the crater Pwyll, pits and chaos adjacent...
cogplanetarysatellite imagerystac
These data are infrared image mosaics, tiled to the Mars quadrangle, generated using Thermal Emission Imaging System (THEMIS) images from the 2001 Mars Odyssey orbiter mission. The mosaic is generated at the full resolution of the THEMIS infrared dataset, which is approximately 100 meters/pixel. The mosaic was absolutely photogrammetrically controlled to an improved Viking MDIM network that was develop by the USGS Astrogeology processing group using the Integrated Software for Imagers and Spectrometers. Image-to-image alignment precision is subpixel (i.e., <100m). These 8-bit, qualitative d...
cogplanetarysatellite imagerystac
The Solid State Imager (SSI) on NASA's Galileo spacecraft acquired more than 500 images of Jupiter's moon, Europa. These images vary from relatively low-resolution hemispherical imaging, to high-resolution targeted images that cover a small portion of the surface. Here we provide a set of 92 image mosaics generated from minimally processed, projected Galileo images with photogrammetrically improved locations on Europa's surface.
These images provide users with nearly the entire Galileo Europa imaging dataset at its native resolution and with improved relative image locations. The S...
cogplanetarysatellite imagerystac
The Solid State Imager (SSI) on NASA's Galileo spacecraft acquired more than 500 images of Jupiter's moon, Europa. These images vary from relatively low-resolution hemispherical imaging, to high-resolution targeted images that cover a small portion of the surface. Here we provide a set of 481 minimally processed, projected Galileo images with photogrammetrically improved locations on Europa's surface. These individual images were subsequently used as input into a set of 92 observation mosaics.
These images provide users with nearly the entire Galileo Europa imaging dataset at its native resolution and with improved relative image locations. The Solid State Imager on NASA's Galileo spacecraft provided the only moderate- to high-resolution images of Jupiter's moon, Europa. Unfortunately, uncertainty in the position and pointing of the spacecraft, as well as the position and orientation of Europa, when the images were acquired resulted in significant errors in image locations on the surface. The result of these errors is that images acquired during different Galileo orbits, or even at different times during the same orbit, are significantly misaligned (errors of up to 100 km on the surface).
The dataset provides a set of individual images that can be used for scientific analysis...
cogelevationplanetarysatellite imagerystac
As of March, 2023 the Mars Reconnaissance Orbiter (MRO) High Resolution Science Experiment (HiRISE) sensor has collected more than 5000 targeted stereopairs. During HiRISE acquisition, the Context Camera (CTX) also collects lower resolution, higher spatial extent context images. These CTX acquisitions are also targeted stereopairs. This data set contains targeted CTX DTMs and orthoimages, created using the NASA Ames Stereopipeline. These data have been created using relatively controlled CTX images that have been globally bundle adjusted using the USGS Integrated System for Imagers and Spectro...
cogplanetarysatellite imagerystac
These data are digital terrain models (DTMs) created by multiple different institutions and released to the Planetary Data System (PDS) by the University of Arizona. The data are processed from the Planetary Data System (PDS) stored JP2 files, map projected, and converted to Cloud Optimized GeoTiffs (COGs) for efficient remote data access. These data are controlled to the Mars Orbiter Laser Altimeter (MOLA). Therefore, they are a proxy for the geodetic coordinate reference frame. These data are not guaranteed to co-register with an uncontrolled products (e.g., the uncontrolled High Resolution ...
cogplanetarysatellite imagerystac
These data are red and color Reduced Data Record (RDR) observations collected and originally processed by the High Resolution Imaging Science Experiment (HiRISE) team. The mdata are processed from the Planetary Data System (PDS) stored RDRs, map projected, and converted to Cloud Optimized GeoTiffs (COGs) for efficient remote data access. These data are not photogrammetrically controlled and use a priori NAIF SPICE pointing. Therefore, these data will not co-register with controlled data products. Data are released using simple cylindrical (planetocentric positive East, center longitude 0, -180...
agricultureclimatemeteorologicalweather
UPDATE TO GHCN PREFIXES - The NODD team is working on improving performance and access to the GHCNd data and will be implementing an updated prefix structure. For more information on the prefix changes, please see the "READ ME on the NODD Github". If you have questions, comments, or feedback, please reach out to nodd@noaa.gov with GHCN in the subject line.
Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements ...
agricultureclimatedisaster responseenvironmentalweather
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
The HRRR ZARR formatted data was originally generated by the University of Utah under a grant provided by NOAA. They are are continuing to publish ZARR versions of HRRR data. For information about data in the s3://hrrrzarr/ please contact Details →
earth observationenergygeospatialmeteorologicalsolar
Released to the public as part of the Department of Energy's Open Energy Data Initiative, the National Solar Radiation Database (NSRDB) is a serially complete collection of hourly and half-hourly values of the three most common measurements of solar radiation – global horizontal, direct normal, and diffuse horizontal irradiance — and meteorological data. These data have been collected at a sufficient number of locations and temporal and spatial scales to accurately represent regional solar radiation climates.
graphjsonmetadatascholarly communication
An open, comprehensive index of scolarly papers, citations, authors, institutions, and journals.
biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy
The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.
agricultureanalysis ready dataceosdisaster responseearth observationgeospatialsatellite imagerystacsustainabilitysynthetic aperture radar
The RADARSAT Constellation Mission (RCM) is Canada's third generation of Earth observation satellites. Launched on June 12, 2019, the three identical satellites work together to bring solutions to key challenges for Canadians. As part of ongoing Open Government efforts, NRCan produces a CEOS analysis ready data (ARD) of Canada landmass using a 30M Compact-Polarization standard coverage, every 12 days. RCM CEOS-ARD (POL) is the first ever polarimetric dataset approved by the CEOS committee. Previously, users were stuck ordering, downloading and processing RCM images (level 1) on their own, often with expensive software. This new dataset aims to remove these burdens with a new STAC catalog for discovery and direct download links.
La mission de la Constellation RADARSAT (MCR) est la troisième génération de satellites d'observation de la Terre du Canada. Lancés le 12 juin 2019, les trois satellites identiques travaillent ensemble pour apporter des solutions aux principaux défis des Canadiens. Dans le cadre des efforts continus pour un gouvernement ouvert, RNCan produit des données prêtes à l'analyse CEOS (ARD) de la masse terrestre du Canada en utilisant une couverture standard de 30 m en polarisation compacte, tous les 12 jours. Les CEOS-ARD (POL) du MCR constituent le premier ensemble de données polarimétriques jamais approuvé par le comité CEOS. Auparavant, les utilisateurs étaient obligés de commander, de télécharger...
bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing
Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.
agricultureclimateearth observationenvironmentalmeteorologicalmodelsustainabilitywaterweather
SILO is a database of Australian climate data from 1889 to the present. It provides continuous, daily time-step data products in ready-to-use formats for research and operational applications. SIL...
climateearth observationenvironmentalgeospatialglobaloceans
Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.
agriculturecogdisaster responseearth observationgeospatialsatellite imagerysynthetic aperture radar
Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.
agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagery
Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.
cogearth observationenvironmentalgeospatiallandoceanssatellite imagerystac
This data set consists of observations from the Sentinel-3 satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-3 is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the Ocean and Land Colour Instrument (OLCI) for medium resolution marine and terrestrial optical measurements, the Sea and Land Surface Temperature Radiometer (SLSTR), the SAR Radar Altimeter (SRAL), the MicroWave Radiometer (MWR) and the Precise Orbit Determination (POD) instruments. The satellite was launched in 2016 and entered routine operational phase in 201...
air qualityatmospherecogearth observationenvironmentalgeospatialsatellite imagerystac
This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. The satellite was launched in October 2017 and entered ro...
biodiversitybiologyecosystemsimage processingmultimediawildlife
The SiPeCaM goal is to create a data source that allows to evaluate changes in the biodiversity state, considering key aspect of how does the ecosystem behaves.
analyticsbroadbandcitiescivicdisaster responsegeospatialglobalgovernment spendinginfrastructureinternetmappingnetwork trafficparquetregulatorytelecommunicationstiles
Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.
meteorologicalsatellite imageryweather
Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.
bioinformaticshealthlife sciencesnatural language processingus
The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and gov...
geneticgenome wide association studygenomiclife sciencespopulation genetics
Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.
geneticgenome wide association studygenomiclife sciencespopulation genetics
A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.
robotics
This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 ...
agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery
iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this dataset are images of predicted soil properties, model error and satellite covariates used in the mapping process.
disaster responsegeospatialmappingosm
The real-changesets is an augmented representation of OpenStreetMap changesets in JSON format. It contains the current and the previous version of each feature in a changeset. It's primary used by OSMCha, the main OpenStreetMap validation tool, to have a visualization of the changeset and provide to the user the understanding of what was changed on the map. The real-changesets are created by combining the changeset metadata and the augmented diff generated by overpass.
agriculturelidarlocalizationmappingrobotics
AG-LOAM dataset has been released to facilitate the evaluation of LiDAR-based odometry algorithms in agricultural environments.
cogdisaster responsegeospatialsatellite imagerystac
synthetic Aperture Radar (SAR) data is a powerful tool for monitoring and assessing disaster events and can provide valuable insights for researchers, scientists, and emergency response teams. The Alaska Satellite Facility (ASF) curates this collection of (primarily) SAR and SAR-derived satellite data products from a variety of data sources for disaster events.
bioinformaticslife sciencesmedicinepharmaceuticalstructural biology
AdaptiveFlow Versions of Ligand Libraries in Ready-To-Dock Format
biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology
This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.
biologygene expressiongeneticimage processingimaginglife sciencesMus musculusneurobiologytranscriptomics
The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an...
cancergeneticgenomicHomo sapienslife sciencesSTRIDES
Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...
climateenvironmentalsatellite imagery
A dataset of satellite retrievals of atmospheric methane that extends from 30 April 2018 to present.
cancercomputational pathologycomputer visiondeep learninghistopathologylife sciences
This page describes the COBRA (Classification Of Basal cell carcinoma, Risky skin cancers and Abnormalities) skin pathology dataset, which comprises over 7000 histopathology whole-slide-images related to the diagnosis of basal cell carcinoma skin cancer, the most commonly diagnosed cancer. The dataset includes biopsies and excisions and is divided into four groups. The first group contains about 2,500 BCC biopsies with subtype labels, while the second group includes 2,500 non-BCC biopsies with different types of skin dysplasia. The third group has 1,000 labelled risky cancer biopsies, includin...
coronavirusCOVID-19life sciences
A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis
biodiversitybioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingcziimaginglife sciencesmachine learningmicroscopymodelproteinzarr
This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.
biodiversitybiologybiomolecular modelingcell biologyczihdf5life sciencesmachine learningmodelproteintranscriptomics
This dataset contains a transcriptomics biological data and models. The models embed transcriptomic data and facilitate transcriptomic analysis. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.
elevationfloodsgeospatiallandlidarurban
The LiDAR Point Clouds is a product that is part of the CanElevation Series created to support the National Elevation Data Strategy implemented by NRCan.
This product contains point clouds from various airborne LiDAR acquisition projects conducted in Canada. These airborne LiDAR acquisition projects may have been conducted by NRCan or by various partners. The LiDAR point cloud data is licensed under an open government license and has been incorporated into the National Elevation Data Strategy.
Point cloud files are distributed by LiDAR acquisition project without integration between projects.
The point cloud files are distributed using the compressed .LAZ / Cloud Optimized Point Cloud (COPC) format. The COPC open format is an octree reorganization of the data inside a .LAZ 1.4 file. It allows efficient use and visualization rendering via HTTP calls (e.g. via the web), while offering the capabilities specific to the compressed .LAZ format which is already well established in the industry. Point cloud files are therefore both downloadable for local use and viewable via URL links from a cloud computing environment.
The reference system used for all point clouds in the product is NAD83(CSRS), epoch 2010. The projection used is the UTM projection with the corresponding zone. Elevations are orthometric and expressed in reference to the Canadian Geodetic Vertical Datum of 2013 (CGVD2013).
Le produit Nuages de points lidar fait partie de la Série CanÉlévation créée pour appuyer la Stratégie nationale de données d’élévation mise en oeuvre par Ressources naturelles Canada (RNCan).
Ce produit contient les nuages de points obtenus lors de divers projets d’acquisition par lidar aéroporté réalisés au Canada. Ces projets d’acquisition par lidar aéroporté peuvent avoir été réalisés par RNCan ou par divers partenaires. Les données de nuages de points lidar ont une licence de type gouvernement ouvert et ont été intégrés à la Stratégie nationale de données d’élévation.
Les fichiers de nuages de points sont distribués par projet d'acquisition et sans intégration entre les projets.
Les fichiers de nuages de points sont distribués en format compressé .LAZ / Cloud Optimized Point Cloud (COPC). Le format ouvert COPC...
cogconservationdeep learningearth observationenvironmentalgeospatialimage processingland coverlidarsatellite imagery
Mean canopy Tree Height for the Amazon Forest on the period 2020-2024 at 4.78 m of spatial resolution. Created using a deep learning model on high-resolution Planet imagery from the Norway's International Climate and Forest Initiative (NICFI) Satellite Data Program. From the original research paper https://doi.org/10.48550/arXiv.2501.10600
cell biologycomputer visionelectron microscopyimaginglife sciencesorganelle
High resolution images of subcellular structures.
agriculturecomputer visionIMUlidarlocalizationmappingrobotics
CitrusFarm is a multimodal agricultural robotics dataset that provides both multispectral images and navigational sensor data for localization, mapping and crop monitoring tasks.
cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing
The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.
energygeothermal
Data released from projects funded by the Department of Energy's Geothermal Technologies Office (DOE GTO) that are too large or complex to be conveniently accessed by traditional means. The GDR data lake aims to improve and automate access of high-value geothermal data sets, making data actionable and discoverable by researchers and industry to accelerate analysis and advance innovation. This data lake is a sister-data lake to the Department of Energy’s Open Energy Data Initiative (OEDI) Data Lake.
biologycell imagingelectrophysiologyinfrastructurelife sciencesneuroimagingneurophysiologyneuroscience
DANDI is a public archive of neurophysiology datasets, including raw and processed data, and associated software containers. Datasets are shared according to a Creative Commons CC0 or CC-BY licenses. The data archive provides a broad range of cellular neurophysiology data. This includes electrode and optical recordings, and associated imaging data using a set of community standards: NWB:N - NWB:Neurophysiology, BIDS - Brain Imaging Data Structure, and Details →
atmosphereclimateclimate modelgeospatialmodelzarr
The EURO-CORDEX dataset contains regional climate model data for Europe, for use in impacts, decision-making, and climate science. Currently, the bucket contains monthly datasets of 2m air temperature downscaled from CMIP5 global model datasets using different regional climate models.
cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...
agricultureearth observationmeteorologicalweather
The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes avail...
cancergenomiclife sciences
The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.
genomegenotypinggolden retriever lifetime studylife sciencesmorris animal foundation
Morris Animal Foundation’s Golden Retriever Lifetime Study is a longitudinal, prospective study following 3044 golden retrievers. The Study’s purpose is to identify the nutritional, environmental, lifestyle and genetic risk factors for cancer and other diseases. The Golden Oldie’s study enrolled an additional cohort of golden retrievers that had reached the age of 12 years or older and had not yet been diagnosed with a malignant cancer. This population can be used as a control group for conditions with high mortality in younger age. This dataset contains the data for ~1.1 million genetic marke...
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.
aerial imageryagriculturecogearth observationgeospatialimagingmappingnatural resourcesustainability
The State of Indiana Geographic Information Office and IOT Office of Technology manage a series of digital orthophotography dating back to 2005. Every year's worth of imagery is available as Cloud Optimized GeoTIFF (COG) files, original GeoTIFF, and other compressed deliverables such as ECW and MrSID. Additionally, each imagery year is organized into a tile grid scheme covering the entire geography of Indiana. All years of imagery are tiled from a 5,000 ft grid or sub tiles depending upon the resolution of the imagery. The naming of the tiles reflects the lower left coordinate from the...
agricultureearth observationgeospatialimaginglidarmappingnatural resourcesustainability
The State of Indiana Geographic Information Office and IOT Office of Technology manage a series of digital LiDAR LAS files stored in AWS, dating back to the 2011-2013 collection and including the NRCS-funded 2016-2020 collection. These LiDAR datasets are available as uncompressed LAS files, for cloud storage and access. Each year's data is organized into a tile grid scheme covering the entire geography of Indiana, ensuring easy access and efficient processing. The tiles' naming reflects each tile's lower left coordinate, facilitating accurate data management and retrieval. The AWS ...
csvjapanesenatural language processing
Japanese Tokenizer Dictionaries for use with MeCab.
benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome
Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205
bioinformaticshealthlife sciencesnatural language processingus
MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...
computed tomographyhealthimaginglife sciencesmagnetic resonance imagingmedicineniftisegmentation
With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validati...
air temperatureatmosphereforecastgeosciencegeospatialmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. Uncovering 2 years' worth of historical data, updated regularly with a time delay. The Global deterministic model is a global configuration of the Met Office Unified Models providing the most accurate short range deterministic forecast by any national meteorological servic...
forecastgeosciencegeospatialglobalmarinemodelnetcdfocean sea surface heightoceansweather
The Global Ocean component of the Met Office Global Coupled Atmosphere-Land-Ocean-Ice system which has been running in operations since May 2022. The system provides a global physical analysis and coupled forecast products providing 3D daily mean fields of temperature and salinity, zonal and meridional velocities; 2D daily mean fields of sea surface height, bottom temperature, mixed layer depth, sea ice fraction, sea ice thickness and sea ice zonal and meridional velocities; and instantaneous hourly fields for sea surface height, sea surface temperature and surface currents. The Met Office Glo...
forecastgeosciencegeospatialglobalmarinemodelnetcdfocean sea surface heightoceansweather
The Met Office runs global wave forecast models to support marine safety and operational decision making. Met Office configurations are developed to be run using the community wave model WAVEWATCH IIITM. The global wave configuration is designed to generate accurate forecasts for open waters of the world’s oceans and larger seas. The Met Office wave models are forced using wind data from the Met Office Global Atmospheric Hi-Res Model. The global wave model is run to provide a five day outlook for wave characteristics defining height, period and direction of waves within a given sea-state. The ...
forecastgeosciencegeospatialmarinemodelnetcdfocean sea surface heightoceansweather
The Northwest European continental shelf physical ocean model predicts temperature, salinity and circulation for waters surrounding the UK.
Ocean physics analysis provides a 6-day forecast for the North-West European Atlantic shelf at 1.5km resolution:
forecastgeosciencegeospatialmarinemodelnetcdfocean sea surface heightoceansweather
Northwest European continental shelf regional wave model predicting sea-state and various sea and swell wave characteristics for waters surrounding the UK.The Met Office runs global and regional wave forecast models to support marine safety and operational decision making. Met Office configurations are developed to be run using the community wave model WAVEWATCH IIITM. The global wave configuration is designed to generate accurate forecasts for open waters of the world's oceans and larger seas, whilst regional configurations are run in order to improve accuracy closer to the coast. The Met...
air temperatureatmosphereforecastgeosciencegeospatialmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. Uncovering 2 years' worth of historical data, updated regularly with a time delay. The UK deterministic model is a post processed regional downscaled configuration of the Unified Model, covering the UK and Ireland, with a resolution of approximately 0.018 degrees. The Unit...
computer visionurbanusvideo
The Multiview Extended Video with Activities (MEVA) dataset consists video data of human activity, both scripted and unscripted, collected with roughly 100 actors over several weeks. The data was collected with 29 cameras with overlapping and non-overlapping fields of view. The current release consists of about 328 hours (516GB, 4259 clips) of video data, as well as 4.6 hours (26GB) of UAV data. Other data includes GPS tracks of actors, camera models, and a site map. We have also released annotations for roughly 184 hours of data. Further updates are planned.
air temperatureclimateclimate modelclimate projectionsCMIP6cogearth observationenvironmentalglobalmodelNASA Center for Climate Simulation (NCCS)near-surface relative humiditynear-surface specific humiditynetcdfprecipitation
The NEX-GDDP-CMIP6 dataset is comprised of global downscaled climate scenarios derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 6 (CMIP6) and across two of the four "Tier 1" greenhouse gas emissions scenarios known as Shared Socioeconomic Pathways (SSPs). The CMIP6 GCM runs were developed in support of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR6). This dataset includes downscaled projections from ScenarioMIP model runs for which daily scenarios were produced and distributed...
archivesastronomydatacenterimagingsatellite imageryx-ray
NASA data for high energy astrophysics (generally x-ray and gamma-ray domains) is made available here by the High Energy Astrophysics Science Archive Research Center. The HEASARC hosts the full data archives of over 30 different missions spanning 50 years. The data archive for each mission will contain a range of data types from spacecraft housekeeping and raw photon event list data up to high level science-ready products such as images, light curves (time series), and energy spectra.
This is a relatively modest total data volume but contains significant complexity and heterogeneity among the different missions. Data provided here are stored in the Flexible Image Transport System (FITS) format common in astronomy. Higher level products are further defined to be consistent between missions following data model standards agreed by the community and maintained by the HEASARC. Analysis of these data may require software also provided by HEASARC, the HEASoft package, consisting of tools generic to all FITS data, generic to all HEASARC-compliant data, and/or specific to individual missions as appropriate. Some missions provide standard science-ready data products, while others provide low-level data types and software to generate science-ready products from them. See the links for each mission for more information on how to use the data.
The HEASARC Website also has archive browsing tools where you can query for observations corresponding to temporal and spatial constraints among others. These tools will ultimately point to files located on the archive by giving a URL beginning with the path https://heasarc.gsfc.nasa.gov/FTP/. The data that are provided in the ODR follow the same structure, so when our tools give an https access URL, a user can simply swap in s3://nasa-heasarc/ for the first part of that URL and get a cloud URI. Note also that some selections have been made to what has been copied to the ODR, while the HEASARC archive itself remains the definitive and legacy source for the complete datasets.
The HEASARC also...
archivesastronomydatacenterimagingsatellite imagery
NASA data for cosmic microwave background (CMB) analysis is made available here by the Legacy Archive for Microwave Background Data Analysis (LAMBDA), which is a part of NASA's High Energy Astrophysics Science Archive Research Center (HEASARC). LAMBDA hosts the data archives of over 30 different CMB missions spanning 30+ years. The data archive for each mission may contain a range of data types from low-level time-ordered data to high level science-ready products such as sky maps and angular power spectra. Also provided in consistent formats are a variety of full sky maps in complementary ...
astronomymachine learningNASA SMD AI
The SOHO/LASCO data set (prepared for the challenge hosted in Topcoder) provided here comes from the instrument’s C2 telescope and comprises approximately 36,000 images spread across 2,950 comet observations. The human eye is a very sensitive tool and it is the only tool currently used to reliably detect new comets in SOHO data - particularly comets that are very faint and embedded in the instrument background noise. Bright comets can be easily detected in the LASCO data by relatively simple automated algorithms, but the majority of comets observed by the instrument are extremely faint, noise-...
bioinformaticsbiologyGeneLabgenomicimaginglife sciencesspace biology
NASA’s Space Biology Open Science Data Repository (OSDR) introduces a one-stop site where users can explore and contribute a variety of NASA open science biological data. This site consolidates data from the Ames Life Sciences Data Archive (ALSDA) and GeneLab and includes information about the broader NASA Open Science and Open Data initiatives, all at one centralized location. Our mission is to maximize the utilization of the valuable biological research resources and enable new discoveries.
OSDR introduces access to data generated from spaceflight and space relevant experiments that explore ...
analyticsanomaly detectionarchivescomputed tomographydatacenterdigital assetselectricityenergyfluid dynamicsimage processingphysicspost-processingradiationsignal processingsource codeturbulencevideox-rayx-ray tomography
The Large Helical Device (LHD), owned and operated by the National Institute for Fusion Science (NIFS), is one of the world's largest plasma confinement device which employs a heliotron magnetic configuration generated by the superconducting coils. The objectives are to conduct academic research on the confinement of steady-state, high-temperature, high-density plasmas, core plasma physics, and fusion reactor engineering, which are necessary to develop future fusion reactors. All the archived data of the LHD plasma diagnostics are available since the beginning of the LHD experiment, starte...
climateenvironmentalmeteorologicaloceanssustainabilityweather
This dataset includes hourly sea surface temperature and current data collected by satellite-tracked surface drifting buoys ("drifters") of the NOAA Global Drifter Program. The Drifter Data Assembly Center (DAC) at NOAA’s Atlantic Oceanographic and Meteorological Laboratory (AOML) has applied quality control procedures and processing to edit these observational data and obtain estimates at regular hourly intervals. The data include positions (latitude and longitude), sea surface temperatures (total, diurnal, and non-diurnal components) and velocities (eastward, northward) with accompanying uncertainty estimates. Metadata include identification numbers, experiment number, start location and time, end location and time, drogue loss date, death code, manufacturer, and drifter type.
Please note that data from the Global Drifter Program are also available at 6-hourly intervals but derived via alternative methods. The 6-hourly dataset goes back further in time (1979) and may be more appropriate for studies of long-term, low frequency patterns of the oceanic circulation. Yet, the 6-hourly dataset does not resolve fully high-frequency processes such as tides and inertial oscillations as well as sea surface temperature diurnal variability.
[CITING NOAA - hourly position, current, and sea surface temperature from drifters data. Citation for this dataset should include the following information below.]
Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; ...
aerial imageryclimatecogdisaster responseweather
In order to support NOAA's homeland security and emergency response requirements, the National Geodetic Survey Remote Sensing Division (NGS/RSD) has the capability to acquire and rapidly disseminate a variety of spatially-referenced datasets to federal, state, and local government agencies, as well as the general public. Remote sensing technologies used for these projects have included lidar, high-resolution digital cameras, a film-based RC-30 aerial camera system, and hyperspectral imagers. Examples of rapid response initiatives include acquiring high resolution images with the Emerge/App...
agricultureclimatemeteorologicalweather
NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these ex...
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
NOTE - Upgrade NCEP Global Forecast System to v16.3.0 - Effective November 29, 2022 See notification HERE
The Global Forecast System (GFS) is a weather forecast model produced
by the National Centers for Environmental Prediction (NCEP). Dozens of
atmospheric and land-soil variables are available through this dataset,
from temperatures, winds, and precipitation to soil moisture and
atmospheric ozone concentration. The entire globe is covered by the GFS
at a base horizontal resolution of 18 miles (28 kilometers) between grid
points, which is used by the operational forecasters who predict weather
out to 16 days in the future. Horizontal resolution drops to 44 miles
(70 kilometers) between grid point for forecasts between one week and two
weeks.
The NOAA Global Forecast Systems (GFS) Warm Start Initial Conditions are
produced by the National Centers for Environmental Prediction Center (NCEP)
to run operational deterministic medium-range numerical weather predictions.
The GFS is built with the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3)
and the Grid-Point Statistical Interpolation (GSI) data assimilation system.
Please visit the links below in the Documentation section to find more details
about the model and the data...
computer forensicscomputer securitycyber securitydigital forensicsmalwaremixed file datasetransomware
NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common...
cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmicroscopyradiology
Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.
climate projectionsCMIP5CMIP6earth observationenergygeospatialmeteorologicalsolar
The National Climate Database (NCDB) seeks to be the definitive source of climate data for energy applications. The goal of the NCDB is to provide unbiased high temporal and spatial resolution climate data needed for renewable energy modeling. The NCDB seeks to maintain the inherent relationship between the various parameters that are needed to model solar, wind, hydrology and load and provide data for multiple important climate scenarios.
agriculturebiodiversitybiologyclimatedigital preservationecosystemsenvironmental
The National Herbarium of New South Wales is one of the most significant scientific, cultural and historical botanical resources in the Southern hemisphere. The 1.43 million preserved plant specimens have been captured as high-resolution images and the biodiversity metadata associated with each of the images captured in digital form. Botanical specimens date from year 1770 to today, and form voucher collections that document the distribution and diversity of the world's flora through time, particularly that of NSW, Austalia and the Pacific.The data is used in biodiversity assessment, syste...
citieseventsgeospatial
Open City Model is an initiative to provide cityGML data for all the buildings in the United States. By using other open datasets in conjunction with our own code and algorithms it is our goal to provide 3D geometries for every US building.
archivesastronomyatmospheregloballife sciencesopen source softwaresignal processing
This platform is maintained by CRAAM (Mackenzie Radio Astronomy and Astrophysics Center), a research center operated by UPM (Mackenzie Presbyterian University) and INPE (National Institute for Space Research), to provide public and free access for researchers, students, and the interested public to VLF (Very Low Frequency) data from CRAAM's antenna systems. Amazon AWS supports all data stored through the AWS Open Data Program. Very Low Frequency (VLF) signals can be used for navigation services, communication with submarines, and are a powerful tool to study the low-altitude Earth's io...
alphafoldlife sciencesmsaopen source softwareopenfoldproteinprotein foldingprotein template
Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth. MSAs were generated with HHBlits (-n3) and JackHMMER against MGnify, BFD, UniRef90, and UniClust30 while templates were identified from PDB70 with HHSearch, all according to procedures outlined in the supplement to the AlphaFold 2 Nature paper, Jumper et al. 2021. We expect the database to be broadly useful to structural biologists training or valid...
artdeep learningimage processinglabeledmachine learningmedia
PD12M is a collection of 12.4 million CC0/PD image-caption pairs for the purpose of training generative image models.
autonomous vehiclescomputer visionlidarmarine navigationrobotics
This dataset presents a multi-modal maritime dataset acquired in restricted waters in Pohang, South Korea. The sensor suite is composed of three LiDARs (one 64-channel LiDAR and two 32-channel LiDARs), a marine radar, two visual cameras used as a stereo camera, an infrared camera, an omnidirectional camera with 6 directions, an AHRS, and a GPS with RTK. The dataset includes the sensor calibration parameters and SLAM-based baseline trajectory. It was acquired while navigating a 7.5 km route that includes a narrow canal area, inner and outer port areas, and a near-coastal area. The aim of this d...
bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome
QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.
computed tomographycomputer visioncoronavirusCOVID-19grand-challenge.orgimaginglife sciencesSARS-CoV-2
The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.
cogearth observationgeospatialnatural resourcesatellite imagerywater
Aquatic reflectance produced with the dark spectrum fitting (DSF) algorithm as implemented in the Atmospheric Correction for OLI “lite” (ACOLITE) software (version 20221114.0). Aquatic reflectance is defined here as unitless water-leaving radiance reflectance and represents the ratio of water-leaving radiance (units of watts per square meter per steradian per nanometer) to downwelling irradiance (units of watts per square meter per nanometer) multiplied by pi.
digital preservationfree softwareopen source softwaresource code
Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Deb...
cyber securitydeep learninglabeledmachine learning
A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the sam...
aerial imagerygeospatialimagingmapping
The State of Colorado has gathered public historical imagery ranging from 2005 to 2021.
astronomy
TESS-Gaia Light Curve (TGLC) is a PSF-based TESS full-frame image (FFI) light curve product. Using Gaia DR3 as priors, the team forward models the FFIs with the effective point spread function to remove contamination from nearby stars. The resulting light curves show a photometric precision closely tracking the pre-launch prediction of the noise level: TGLC's photometric precision consistently reaches ≲2% at 16th TESS magnitude even in crowded fields, demonstrating excellent decontamination and deblending power.
amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome
The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...
genome wide association studygenomiclife scienceslofteevep
VEP determines the effect of genetic variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. The European Bioinformatics Institute produces the VEP tool/db and releases updates every 1 - 6 months. The latest release contains 267 genomes from 232 species containing 5567663 protein coding genes. This dataset hosts the last 5 releases for human, rat, and zebrafish. Also, it hosts the required reference files for the Loss-Of-Function Transcript Effect Estimator (LOFTEE) plugin as it is commonly used with VEP.
atmosphereclimateearth observationforecastgeosciencehydrologymeteorologicalmodeloceansweather
Global real-time Earth system data deemed by the World Meteorological Organisation (WMO) as essential for provision of services for the protection of life and property and for the well-being of all nations. Data is sourced from all WMO Member countries / territories and retained for 24-hours. Met Office and NOAA operate this Global Cache service curating and publishing the dataset on behalf of WMO.
agricultureclimateclimate modelclimate projectionsdisaster responseelectricityenergyenvironmentalgeospatialmeteorologicalsolarsustainabilityweather
Wildfire projections for California and her environs in support of California's Fifth Climate Assessment supported with historical weather observations and renewable energy capacity profiles for grid operations.
benchmarkenergymachine learning
This data lake contains multiple datasets related to fundamental problems in wind energy research. This includes data for wind plant power production for various layouts/wind flow scenarios, data for two- and three-dimensional flow around different wind turbine airfoils/blades, wind turbine noise production, among others. The purpose of these datasets is to establish a standard benchmark against which new AI/ML methods can be tested, compared, and deployed. Details regarding the generation and formatting of the data for each dataset is included in the metadata as well as example noteboo...
fastqgeneticgenomiclife scienceswhole genome sequencing
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
1940 censusarchivescensusdemographynara
The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The 1940 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 2, 2012. The 1940 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The coverage is nation wide and inclu...
1950 censusarchivescensusdemographynara
The 1950 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1950, although some persons were missed. The 1950 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 1, 2022. The 1950 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The coverage is nation wide and inclu...
censusdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousinghousing unitslatinonoisy measurementspopulationraceredistrictingvoting age
The 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File (2023-04-03) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9 , and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations, that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File.
The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the redistricting data portion of the 2010 Demonstration Data Products Suite – Redistricting and Demographic and Housing Characteristics File – Production Settings (2023-04-03). These statistical queries, called “noisy measurements” were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016] https://arxiv.org/abs/1605.02065; see also Dwork C. and Roth, A. [2014] https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] https://arxiv.org/abs/2004.00010), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Redistricting Data (P.L.94-171) Demonstration Noisy Measurement File (2023-04-03) has been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004).
The data includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/04-Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census.
The NMF provides estimates of counts of...
censusdifferential privacydisclosure avoidanceethnicitygroup quartershousinghousing unitsnoisy measurementspopulationraceredistrictingvoting age
The 2020 Census Redistricting Data (P.L. 94-171) Noisy Measurement File (NMF) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9, and implemented in the DAS 2020 Redistricting Production Code). The NMF was generated using the Census Bureau's implementation of the Discrete Gaussian Mechanism, calibrated to satisfy zero-Concentrated Differential Privacy with bounded neighbors.
The NMF values, called noisy measurements are the output of applying the Discrete Gaussian Mechanism to counts from the 2020 Census Edited File (CEF). They are generally inconsistent with one another (for example, in a county composed of two tracts, the noisy measurement for the county's total population may not equal the sum of the noisy measurements of the two tracts' total population), and frequently negative (especially when the population being measured was small), but are integer-valued. The NMF was later post-processed as part of the DAS code to take the form of microdata and to satisfy various constraints. The NMF documented here contains both the noisy measurements themselves as well as the data needed to represent the DAS constraints; thus, the NMF could be used to reproduce the steps taken by the DAS code to produce microdata from the noisy measurements by applying the production code base.
The 2020 Census Redistricting Data (P.L. 94-171) Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2020 Census Edited File (CEF), which includes confidential data initially collected in the 2020 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File.
The NMF ...
bioinformaticsbiologygeneticgenomicimaginglife sciences
The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...
floodsglobalnear-surface air temperaturenear-surface specific humiditynetcdfprecipitation
Hydrological extremes, in the form of droughts and floods, have impacts on a wide range of sectors including water availability, food security, and energy production, among others. Given continuing large impacts of droughts and floods and the expectation for significant regional changes projected in the future, there is an urgent need to provide estimates of past events and their future risk, globally. However, current estimates of hydrological extremes are not robust and accurate enough, due to lack of long-term data records, standardized methods for event identification, geographical inconsi...
blockchainweb3
The AWS Public Blockchain Data initiative provides free access to blockchain datasets through collaboration with data providers. The data is optimized for analytics by being transformed into compressed Parquet files, partitioned by date for efficient querying.
Blockchain dataset | Maintained by | Path |
---|---|---|
Bitcoin | AWS | s3://aws-public-blockchain/v1.0/btc/ |
Ethereum | AWS | s3://aws-public-blockchain/v1.0/eth/ |
SocialGene RefSeq Databases
amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein
Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.
Details →
Usage examples
See 3 usage examples →