The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

Get started using data quickly by viewing all tutorials with associated SageMaker Studio Lab notebooks.

See all usage examples for datasets listed in this registry.

See datasets from Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Data for Good at Meta, NASA Space Act Agreement, NIH STRIDES, NOAA Open Data Dissemination Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.


Search datasets (currently 95 matching datasets)


Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.


Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

The Cancer Genome Atlas

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Details →

Usage examples

See 29 usage examples →

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...

Details →

Usage examples

See 24 usage examples →

Sentinel-2

agriculturedisaster responseearth observationgeospatialnatural resourcesatellite imagerystac

The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in on-going studies. L1C data are available from June 2015 globally. L2A data are available from November 2016 over Europe region and globally since January 2017.

Details →

Usage examples

See 23 usage examples →

The Singapore Nanopore Expression Data Set

bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics

The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...

Details →

Usage examples

See 15 usage examples →

Low Altitude Disaster Imagery (LADI) Dataset

aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater

The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2023. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.

Details →

Usage examples

See 11 usage examples →

Catalina Sky Survey (CSS) subset data on AWS

astronomyobject detectionplanetarysurvey

Raw data that discovers Near Earth Objects (NEOs) which potentially could impact Earth

Details →

Usage examples

See 9 usage examples →

Cancer Cell Line Encyclopedia (CCLE)

cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.

Details →

Usage examples

See 8 usage examples →

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.

Details →

Usage examples

  • Cancer Genomics Cloud by Seven Bridges
  • Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer by Hui Zhang, Tao Liu, Zhen Zhang, Samuel H. Payne, Bai Zhang, Jason E. McDermott, Jian-Ying Zhou, Vladislav A. Petyuk, Li Chen, Debjit Ray, Shisheng Sun, Feng Yang, Lijun Chen, Jing Wang, Punit Shah, Seong Won Cha, Paul Aiyetan, Sunghee Woo, Yuan Tian, Marina A. Gritsenko, Therese R. Clauss, Caitlin Choi, Matthew E. Monroe, Stefani Thomas, Song Nie, Chaochao Wu, Ronald J. Moore, Kun-Hsing Yu, David L. Tabb, David Fenyö, Vineet Bafna, Yue Wang, Henry Rodriguez, Emily S. Boja, Tara Hiltke, Robert C. Rivers, Lori Sokoll, Heng Zhu, Ie-Ming Shih, Leslie Cope, Akhilesh Pandey, Bing Zhang, Michael P. Snyder, Douglas A. Levine, Richard D. Smith, Daniel W. Chan, Karin D. Rodland, the CPTAC Investigators
  • Genomic Data Commons by National Cancer Institute
  • Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities by Suhas Vasaikar, Chen Huang, Xiaojing Wang. Vladislav A. Petyuk, Sara R. Savage, Bo Wen, Yongchao Dou, Yun Zhang, Zhiao Shi, Osama A. Arshad, Marina A. Gritsenko, Lisa J. Zimmerman, Jason E. McDermott, Therese R. Clauss, Ronald J. Moore, Rui Zhao, Matthew E. Monroe, Yi-Ting Wang, Matthew C. Chambers, Robbert J.C. Slebos, Ken S. Lau, Qianxing Mo, Li Ding, Matthew Ellis, Mathangi Thiagarajan, Christopher R. Kinsinger, Henry Rodriguez, Richard D. Smith, Karin D. Rodland, Daniel C. Liebler, Tao Liu, Bing Zhang, Clinical Proteomic Tumor Analysis Consortium
  • Proteomic analysis of colon and rectal carcinoma using standard and customized databases by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC

See 7 usage examples →

BossDB Open Neuroimagery Datasets

calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography

This data ecosystem, Brain Observatory Storage Service & Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.

Details →

Usage examples

See 6 usage examples →

Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.

Details →

Usage examples

See 6 usage examples →

New York City Taxi and Limousine Commission (TLC) Trip Record Data

citiestransportationurban

Data of trips taken by taxis and for-hire vehicles in New York City. Note: access to this dataset is free, however direct S3 access does require an AWS account. Anonymous downloads are accessible from the dataset's documentation webpage listed below.

Details →

Usage examples

See 6 usage examples →

The MIT Supercloud Dataset

cloud computingdatacenterenergyHPCworkload analysis

Collection of parsed datacenter logs and time series data of hardware utilization from the MIT Supercloud system.

Details →

Usage examples

See 6 usage examples →

CAncer MEtastases in LYmph nOdes challeNge (CAMELYON) Dataset

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the all data for the CAncer MEtastases in LYmph nOdes challeNge or CAMELYON. CAMELYON was the first challenge using whole-slide images in computational pathology and aimed to help pathologists identify breast cancer metastases in sentinel lymph nodes. Lymph node metastases are extremely important to find, as they indicate that the cancer is no longer localized and systemic treatment might be warranted. Searching for these metastases in H&E-stained tissue is difficult and time-consuming and AI algorithms can play a role in helping make this faster and more accura...

Details →

Usage examples

See 5 usage examples →

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...

Details →

Usage examples

See 5 usage examples →

KyFromAbove on AWS

aerial imageryearth observationelevationgeospatiallidar

The KyFromAbove initiative is focused on building and maintaining a current basemap for Kentucky that can meet the needs of its users at the state, federal, local, and regional level. A common basemap, including current color leaf-off aerial photography and elevation data (LiDAR), reduces the cost of developing GIS applications, promotes data sharing, and add efficiencies to many business processes. All basemap data acquired through this effort is being made available in the public domain. KyFromAbove acquires aerial imagery and LiDAR during leaf-off conditions in the Commonwealth. The imagery...

Details →

Usage examples

See 5 usage examples →

MONKEY

cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging

This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...

Details →

Usage examples

See 5 usage examples →

Normalized Difference Urban Index (NDUI)

earth observationgeospatialsatellite imageryurban

NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales.

Details →

Usage examples

See 5 usage examples →

SPARTAN Data

air qualityenvironmental

SPARTAN (Surface PARTiculate mAtter Network) measures and provides surface ambient particulate matter (PM2.5 and PM10) concentration and the chemical composition around the world, with the purpose of connecting ground-based PM2.5 and satellite remote sensing.

Details →

Usage examples

See 5 usage examples →

Chalmers Cloud Ice Climatology

atmosphereclimatedeep learningenvironmentalexplorationgeophysicsgeosciencegeospatialglobaliceplanetarysatellite imageryzarr

The Chalmers Cloud Ice Climatology (CCIC) is a novel, deep-learning-based climate record of ice-particle concentrations in the atmosphere. CCIC results are available at high spatial and temporal resolution (0.07° / 3 h from 1983, 0.036° / 30 min from 2000) and thus ideally suited for evaluating high-resolution weather and climate models or studying individual weather systems.

Details →

Usage examples

See 4 usage examples →

Molecular Profiling to Predict Response to Treatment (phs001965)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...

Details →

Usage examples

See 4 usage examples →

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Details →

Usage examples

See 4 usage examples →

Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1

climateearth observationenvironmentalgeospatialglobaloceans

Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.

Details →

Usage examples

See 4 usage examples →

Sentinel-1

agriculturecogdisaster responseearth observationgeospatialsatellite imagerysynthetic aperture radar

Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.

Details →

Usage examples

See 4 usage examples →

Sentinel-2 L2A 120m Mosaic

agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagery

Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.

Details →

Usage examples

See 4 usage examples →

Yale-CMU-Berkeley (YCB) Object and Model Set

robotics

This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 ...

Details →

Usage examples

See 4 usage examples →

iSDAsoil

agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery

iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this dataset are images of predicted soil properties, model error and satellite covariates used in the mapping process.

Details →

Usage examples

See 4 usage examples →

AG-LOAM Dataset

agriculturelidarlocalizationmappingrobotics

AG-LOAM dataset has been released to facilitate the evaluation of LiDAR-based odometry algorithms in agricultural environments.

  1. It was collected by a wheeled mobile robot at the Agricultural Experimental Station of the University of California, Riverside, during Winter 2022 and Winter 2023.
  2. It provides LiDAR point cloud data captured using a Velodyne VLP-16 sensor, along with ground-truth trajectories obtained from an RTK-GPS system.
  3. It consists of 18 sequences collected over three phases, covering diverse planting environments, terrain conditions, path patterns, and robot motion profiles.
  4. It ...

    Details →

    Usage examples

    See 3 usage examples →

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Details →

Usage examples

See 3 usage examples →

COBRA

cancercomputational pathologycomputer visiondeep learninghistopathologylife sciences

This page describes the COBRA (Classification Of Basal cell carcinoma, Risky skin cancers and Abnormalities) skin pathology dataset, which comprises over 7000 histopathology whole-slide-images related to the diagnosis of basal cell carcinoma skin cancer, the most commonly diagnosed cancer. The dataset includes biopsies and excisions and is divided into four groups. The first group contains about 2,500 BCC biopsies with subtype labels, while the second group includes 2,500 non-BCC biopsies with different types of skin dysplasia. The third group has 1,000 labelled risky cancer biopsies, includin...

Details →

Usage examples

See 3 usage examples →

CitrusFarm Dataset

agriculturecomputer visionIMUlidarlocalizationmappingrobotics

CitrusFarm is a multimodal agricultural robotics dataset that provides both multispectral images and navigational sensor data for localization, mapping and crop monitoring tasks.

  1. It was collected by a wheeled mobile robot in the Agricultural Experimental Station at the University of California Riverside in the summer of 2023.
  2. It offers a total of nine sensing modalities, including stereo RGB, depth, monochrome, near-infrared and thermal images, as well as wheel odometry, LiDAR, IMU and GPS-RTK data.
  3. It comprises seven sequences collected from three citrus tree fields, featuring various tree species at different growth stages, distinctive planting patterns, as well as varying daylight conditions.
  4. It spans a total operation time of 1.7 hours, covers a total distance of 7.5 km, and consti...

    Details →

    Usage examples

    See 3 usage examples →

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.

Details →

Usage examples

  • Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson, Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., Monica Kasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D., Ph.D., Wenming Xiao, Ph.D., et al.
  • Genomic Data Commons by National Cancer Institute
  • A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd, Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, Zhao H, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, Gascoyne RD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M, Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM

See 3 usage examples →

Exceptional Responders Initiative

cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...

Details →

Usage examples

See 3 usage examples →

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

cancergenomiclife sciences

The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

Details →

Usage examples

See 3 usage examples →

MIMIC-III (‘Medical Information Mart for Intensive Care’)

bioinformaticshealthlife sciencesnatural language processingus

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...

Details →

Usage examples

See 3 usage examples →

NapierOne Mixed File Dataset

computer forensicscomputer securitycyber securitydigital forensicsmalwaremixed file datasetransomware

NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common...

Details →

Usage examples

See 3 usage examples →

Open VLF: Scientific Open Data Initiative for CRAAM's SAVNET and AWESOME VLF Data.

archivesastronomyatmospheregloballife sciencesopen source softwaresignal processing

This platform is maintained by CRAAM (Mackenzie Radio Astronomy and Astrophysics Center), a research center operated by UPM (Mackenzie Presbyterian University) and INPE (National Institute for Space Research), to provide public and free access for researchers, students, and the interested public to VLF (Very Low Frequency) data from CRAAM's antenna systems. Amazon AWS supports all data stored through the AWS Open Data Program. Very Low Frequency (VLF) signals can be used for navigation services, communication with submarines, and are a powerful tool to study the low-altitude Earth's io...

Details →

Usage examples

See 3 usage examples →

STOIC2021 Training

computed tomographycomputer visioncoronavirusCOVID-19grand-challenge.orgimaginglife sciencesSARS-CoV-2

The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.

Details →

Usage examples

See 3 usage examples →

Biodiversity Heritage Library Metadata and Page Images

biodiversitybioinformaticslife sciences

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access.

Details →

Usage examples

See 5 usage examples →

Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer

cancergenomiclife sciencesSTRIDEStranscriptomics

The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Details →

Usage examples

See 2 usage examples →

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

bioinformaticsbiologygenomiclife sciencesmappingmedicinereference indexwhole genome sequencing

Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.

Details →

Usage examples

See 2 usage examples →

Copernicus Digital Elevation Model (DEM)

agriculturecogdisaster responseearth observationelevationgeospatialsatellite imagery

The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized Ge...

Details →

Usage examples

See 2 usage examples →

CoversBR

copyright monitoringcover song identificationlive song identificationmusicmusic features datasetmusic information retrievalmusic recognition

CoversBR is the first large audio database with, predominantly, Brazilian music for the tasks of Covers Song Identification (CSI) and Live Song Identifications (LSI). Due to copyright restrictions audios of the songs cannot be made available, however metadata and files of features have public access. Audio streamings captured from radio and TV channels for the live song identification task will be made public. CoversBR is composed of metadata and features extracted from 102298 songs, distributed in 26366 groups of covers/versions, with an average of 3.88 versions per group. The entire collecti...

Details →

Usage examples

See 2 usage examples →

DigitalCorpora

computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →

Usage examples

See 2 usage examples →

Downscaled Climate Data for Alaska (v1.1, August 2023)

agricultureclimatecoastalearth observationenvironmentalsustainabilityweather

This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100)....

Details →

Usage examples

See 2 usage examples →

EMory BrEast Imaging Dataset (EMBED)

biasbiologycancerhealthimaginglife sciencesmammographyx-ray

EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT) images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity classes. This release represents 20% of the total 2D and C-view dataset and is available for research use. DBT, US, and MRI exams will be added at a ...

Details →

Usage examples

See 2 usage examples →

Emory Knee Radiograph (MRKR) dataset

bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...

Details →

Usage examples

See 2 usage examples →

Genomic Characterization of Metastatic Castration Resistant Prostate Cancer

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Biopsies of castration resistant prostate cancer metastases were subjected to whole genome sequencing (WGS), along with RNA-sequencing (RNA-Seq). The overarching goal of the study is to illuminate molecular mechanisms of acquired resistance to therapeutic agents, and particularly androgen signaling inhibitors, in the treatment of metastatic castration resistant prostate cancer (mCRPC). This study is made available on AWS via the NIH STRIDES Initiative.

Details →

Usage examples

See 2 usage examples →

Integrative Analysis of Lung Adenocarcinoma in Environment and Genetics Lung cancer Etiology (Phase 2)

cancerepigenomicsgenomiclife sciencesSTRIDESwhole exome sequencingwhole genome sequencing

We performed whole genome sequencing and whole exome sequencing of 31 lung adenocarcinoma (LUAD) samples from the Environment And Genetics in Lung cancer Etiology (EAGLE) study. The EAGLE study is made available on AWS via the NIH STRIDES Initiative (https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).

Details →

Usage examples

See 2 usage examples →

LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge (LEOPARD) Dataset

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the all data for the LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge or LEOPARD.Prostate cancer, impacting 1.4 million men annually, is a prevalent malignancy (H. Sung et al., 2021). A substantial number of these individuals undergo prostatectomy as the primary curative treatment. The efficacy of this surgery is assessed, in part, by monitoring the concentration of prostate-specific antigen (PSA) in the bloodstream. While the role of PSA in prostate cancer screening is debatable (W. F. Clark et al., 2018; E. A. M. Heijnsdijk et al., 2018), it serves as a valuable biomarker for postprostatectomy follow-up in patients. Following successful surgery, PSA concentration is typically undetectable (<0.1 ng/mL) within 4-6 weeks (S. S. Goonewardene et al., 2014). However, approximately 30% of patients experience biochemical recurrence, signifying the resurgence of prostate cancer cells. This recurrence serves as a prognostic indicator for progression to clinical metastases and eventual prostate cancer-related mortality (C. L. Amling, 2014; S. J. Freedland et al., 2005; M. Han et al., 2001; T. Van den Broeck et al., 2001. Current clinical practices gauge the risk of biochemical recurrence by considering the International Society of Urological Pathology (ISUP) grade, PSA value at diagnosis, and TNM staging criteria (J. I. Epstein et al., 2016). A recent European consensus guideline suggests categorizing patients into low-risk, intermediate-risk, and high-risk groups based on these factors (N. Mottet et al., 2021). Notably, a high ISUP grade independently assigns a patient to the intermediate (grade 2/3) or high-risk group (grade 4/5). The Gleason growth patterns, representing morphological patterns of prostate cancer, are used to categorize cancerous tissue into ISUP grade groups (J. I. Epstein, 2010; P. M. Pierorazio et al., 2013; G. J. L. H. van Leenders et al., 2020; J. I. Epstein et al., 2016). However, the ISUP grade has limitations, such as grading disagreement among pathologists (J. I. Epstein et al., 2016) and coarse descriptors of tissue morphology. Recently, deep learning was shown (H. Pinckaers et al., 2022; O. Eminaga et. al., 2024)...

Details →

Usage examples

See 2 usage examples →

National Cancer Institute Center for Cancer Research - Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expression

cancergenomiclife sciences

The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas (DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification data.

Details →

Usage examples

See 2 usage examples →

Oregon Health & Science University Chronic Neutrophilic Leukemia Dataset

cancergenomiclife sciences

The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rare hematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloid leukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms (MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.

Details →

Usage examples

See 2 usage examples →

Pancreatic Cancer Organoid Profiling

cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Details →

Usage examples

See 2 usage examples →

RAPID NRT Flood Maps

agriculturedisaster responseearth observationenvironmentalwater

Near Real-time and archival data of High-resolution (10 m) flood inundation dataset over the Contiguous United States, developed based on the Sentinel-1 SAR imagery (2016-current) archive, using an automated Radar Produced Inundation Diary (RAPID) algorithm.

Details →

Usage examples

See 2 usage examples →

Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan

disaster responseearth observationenvironmentalgeospatialsatellite imagerysynthetic aperture radar

The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effec...

Details →

Usage examples

See 2 usage examples →

Sounds of Central African landscapes

biodiversitybiologyecosystemsgeospatiallandlife sciencesnatural resourcesurvey

Archival soundscapes recorded in the rainforest landscapes of Central Africa, with a focus on the vocalizations of African forest elephants (Loxodonta cyclotis).

Details →

Usage examples

See 2 usage examples →

TIGER Training

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemo...

Details →

Usage examples

See 2 usage examples →

Terra Fusion Data Sampler

geospatialsatellite imagery

The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances from the five Terra instruments. They have been fully validate to contain the original Terra instrument Level 1 data. Each Level 1 Terra Basic Fusion file contains one full Terra orbit of data and is typically 15 – 40 GB in size, depending on how much data was collected for that orbit. It contains instrument radiance in physical units; radiance quality indicator; geolocation for each IFOV at its native resolution; sun-view geometry; bservation time; and other attributes/metadata. It is stored in HDF5, conformed to CF conventions, and accessible by netCDF-4 enhanced models. It’s naming convention follows: TERRA_BF_L1B_OXXXX_YYYYMMDDHHMMSS_F000_V000.h5. A concise description of the dataset, along with links to complete documentation and available software tools, can be found on the Terra Fusion project page: https://terrafusion.web.illinois.edu.

Terra is the flagship satellite of NASA’s Earth Observing System (EOS). It was launched into orbit on December 18, 1999 and carries five instruments. These are the Moderate-resolution Imaging Spectroradiometer (MODIS), the Multi-angle Imaging SpectroRadiometer (MISR), the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), the Clouds and Earth’s Radiant Energy System (CERES), and the Measurements of Pollution in the Troposphere (MOPITT).

The Terra Basic Fusion dataset is an easy-to-access record of the Level 1 radiances for instruments on...

Details →

Usage examples

See 2 usage examples →

UniProt

bioinformaticsbiologychemistryenzymegraphlife sciencesmoleculeproteinRDFSPARQL

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.

Details →

Usage examples

See 2 usage examples →

3DCoMPaT: Composition of Materials on Parts of 3D Things

computer visionmachine learning

3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models. This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials. Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. We present two variations of this task and adapt state-of-art 2D/3D deep learning met...

Details →

Usage examples

See 1 usage example →

CIViC (Clinical Interpretation of Variants in Cancer)

geneticgenomiclife sciencesvcf

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...

Details →

Usage examples

See 1 usage example →

Defense Meteorology Satellite Program (DMSP) Auroral Particle Flux

earth observationgeospatialsolarspace weather

The United States Air Force (USAF) Defense Meteorological Satellite Program (DMSP) SSJ precipitating particle instrument measures in-situ total flux and energy distribution of electrons and ions at low earth orbit. These precipitating particles are of interest for space weather operations and research, in part because they produce aurora during normal and very strong geomagnetic storms. This dataset contains both sensor-level raw data (as detailed in Redmon et al. 2017) and a high-level machine-learning-ready data product.

Details →

Usage examples

See 1 usage example →

Ensemble Meteorological Dataset for Planet Earth, EM-Earth

atmospheremeteorologicalnear-surface air temperaturenetcdfprecipitation

EM-Earth provides data for precipitation, mean air temperature, air temperature range, and dew-point temperature at 0.1° spatial resolution over global land areas from 1950 to 2019. EM-Earth provides hourly/daily deterministic estimates, and daily probabilistic estimates (25 ensemble members), to meet the diverse requirements of hydrometeorological applications.

Details →

Usage examples

See 1 usage example →

Global Biodiversity Information Facility (GBIF) Species Occurrences

biodiversitybioinformaticsconservationearth observationlife sciences

The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the world's governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...

Details →

Usage examples

See 1 usage example →

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca...

Details →

Usage examples

See 1 usage example →

LOFAR ELAIS-N1 cycle 2 observations on AWS

astronomyimagingsurvey

These data correspond to the International LOFAR Telescope observations of the sky field ELAIS-N1 (16:10:01 +54:30:36) during the cycle 2 of observations. There are 11 runs of about 8 hours each plus the corresponding observation of the calibration targets before and after the target field. The data are measurement sets (MS) containing the cross-correlated data and metadata divided in 371 frequency sub-bands per target centred at ~150 MHz.

Details →

Usage examples

See 1 usage example →

OpenSurfaces

computer vision

A large database of annotated surfaces created from real-world consumer photographs.

Details →

Usage examples

See 1 usage example →

iHART Whole Genome Sequencing Data Set

autism spectrum disorderbamgeneticgenomiclife sciencesvcfwhole genome sequencing

iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).

Details →

Usage examples

See 1 usage example →

recount3

bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.

Details →

Usage examples

See 1 usage example →

Australasian Genomes

biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the University of Sydney on behalf of our collaborators within TSI and CIPPS.

Details →

CAFE60 reanalysis

climatesustainability

The CSIRO Climate retrospective Analysis and Forecast Ensemble system: version 1 (CAFE60v1) provides a large ensemble retrospective analysis of the global climate system from 1960 to present with sufficiently many realizations and at spatio-temporal resolutions suitable to enable probabilistic climate studies. Using a variant of the ensemble Kalman filter, 96 climate state estimates are generated over the most recent six decades. These state estimates are constrained by monthly mean ocean, atmosphere and sea ice observations such that their trajectories track the observed state while enabling ...

Details →

COVID-19 Molecular Structure and Therapeutics Hub

bioinformaticsbiologycoronavirusCOVID-19life sciencesmolecular dockingpharmaceutical

Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community. A community-driven data repository and curation service for molecular structures, models, therapeutics, and simulations related to computational research related to therapeutic opportunities for COVID-19 (caused by the SARS-CoV-2 coronavirus).

Details →

CRC-SAS/SISSA historical seasonal and subseasonal forecast database

agricultureearth observationforecasthydrologymeteorologicalnatural resourceweather

En el marco del Sistema de Información de Sequías del Sur de Sudamérica (SISSA) se ha desarrollado una base de predicciones en escala subestacional y estacional con datos corregidos y sin corregir, con el propósito que permita estudiar predictibilidad en distintas escalas y también que sirva para alimentar modelos de sectores como agricultura e hidrología.

La base contiene datos en escala diaria entre 2000-2019 (sin corregir) y 2010-2019 (corregidos) para diversas variables incluyendo: temperatura media, máxima y mínima, así como también lluvia, viento medio y otras variables pensadas para alimentar modelos hidrológicos y de cultivo.

La base de datos abarca toda el área del Centro Regional del Clima para el sur de sudamérica (CRC-SAS), abarcando desde Bolivia y centro-sur de Brasil hasta la Patagonia incluyendo los países miembros como Chile, Argentina, Brasil, Paraguay, Uruguay y Bolivia.

La base fue generada a partir de datos de GEFSv12 para escala subestacional (GEFS) y CFS2 para escala estacional (CFS2). Para la generación de los datos corregidos se utilizaron los datos del reanálisis de ERA5 (ERA5).


Within the framework of the Southern South American Drought Information System (SISSA), a base of sub-seasonal and seasonal scale predictions has been developed with corrected and uncorrected data, with the purpose of studying predictability at different scales and also to be used to feed models for sectors such as agriculture and hydrology.

The database contains daily scale data between 2000-2019 (uncorrected) and 2010-2019 (corrected) for several variables including: mean, maximum and minimum temperature, as well as rainfall, mean wind and other variables intended to feed hydrological and crop models.

The database covers the entire area of the Regional Climate Center for Southern South America (CRC-SAS), from Bolivia and south-central Brazil to Patagonia, including member countries such as Chile, Argentina, Brazil, Paraguay, Uruguay and Bolivia.

The base was generated from GEFSv12 data for subseasonal scale (GEFS) and CFS2 for seasonal scale (CFS2). Data from the ERA5 reanalysis (ERA5) we...

Details →

Community Multiscale Air Quality (CMAQ) 2019 3D Gridded and Column data from the EPA's Air Quality Time Series (EQUATES) Project

air qualityatmospheremodel

The data are part of EPA’s Air Quality Time Series (EQUATES) Project. The data consist of hourly gridded pollutant concentrations estimates by the Community Multiscale Air Quality (CMAQ) model version 5.3.2 (https://doi.org/10.15139/S3/F2KJSK) for January 1 – December 31, 2019. Model data is provided for two spatial domains : the Northern Hemisphere (108 km x 108km horizontal grid spacing) and the Contiguous United States including parts of Canada and Mexico (12km x 12km horizontal grid spacing). Two types of hourly data are provided: three-dimensional air pollutant concentrations and vert...

Details →

EPA Dynamically Downscaled Ensemble (EDDE) Version 1

agricultureair qualityair temperatureatmosphereclimateclimate modelclimate projectionsCMIP5CMIP6ecosystemselevationenvironmentalEulerianeventsfloodsfluid dynamicsgeosciencegeospatialhdf5healthHPChydrologyinfrastructureland coverland usemeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynear-surface specific humiditynetcdfopen source softwarephysicspost-processingprecipitationradiationsimulationsuswaterweather

The data are a subset of the EPA Dynamically Downscaled Ensemble (EDDE), Version 1. EDDE is a collection of physics-based modeled data that represent 3D atmospheric conditions for historical and future periods under different scenarios. The EDDE Version 1 datasets cover the contiguous United States at a horizontal grid spacing of 36 kilometers at hourly increments. EDDE Version 1 includes simulations that have been dynamically downscaled from multiple global climate models (GCMs) under both mid- and high-emission scenarios from the Fifth Coupled Model Intercomparison Project (CMIP5) using the...

Details →

EPA Dynamically Downscaled Ensemble (EDDE) Version 2

agricultureair qualityair temperatureatmosphereclimateclimate modelclimate projectionsCMIP5CMIP6ecosystemselevationenvironmentalEulerianeventsfloodsfluid dynamicsgeosciencegeospatialhdf5healthHPChydrologyinfrastructureland coverland usemeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynear-surface specific humiditynetcdfopen source softwarephysicspost-processingprecipitationradiationsimulationsuswaterweather

The data are a subset of the EPA Dynamically Downscaled Ensemble (EDDE), Version 2. EDDE is a collection of physics-based modeled data that represent 3D atmospheric conditions for historical and future periods under different scenarios. The EDDE Version 2 datasets cover the contiguous United States at a horizontal grid spacing of 12 kilometers at hourly increments. EDDE Version 2 will include simulations that have been dynamically downscaled from multiple global climate models (GCMs) under multiple emission scenarios from the Sixth Coupled Model Intercomparison Project (CMIP6) using the Weath...

Details →

Epoch of Reionization Dataset

astronomy

The data are from observations with the Murchison Widefield Array (MWA) which is a Square Kilometer Array (SKA) precursor in Western Australia. This particular dataset is from the Epoch of Reionization project which is a key science driver of the SKA. Nearly 2PB of such observations have been recorded to date, this is a small subset of that which has been exported from the MWA data archive in Perth and made available to the public on AWS. The data were taken to detect signatures of the first stars and galaxies forming and the effect of these early stars and galaxies on the evolution of the u...

Details →

High Resolution Downscaled Climate Data for Southeast Alaska

agricultureclimatecoastalearth observationenvironmentalsustainabilityweather

This dataset contains historical and projected dynamically downscaled climate data for the Southeast region of the State of Alaska at 1 and 4km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 4.0). We downscaled both Climate Forecast System Reanalysis (CFSR) historical reanalysis data (1980-2019) and both historical and projected runs from two GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical ru...

Details →

InRad COVID-19 X-Ray and CT Scans

bioinformaticscoronavirusCOVID-19healthlife sciencesmedicineSARS

This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.

Details →

MIMIC-IV Clinical Database Demo

The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access r...

Details →

MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset

The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, we provide the needed information to link the waveform to the report. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.

Details →

MegaScenes

benchmarkcomputer visiondeep learninginternet

The MegaScenes Dataset is an extensive collection of around 430k scenes, featuring over 100k structure-from-motion reconstructions and over 2 million registered images. MegaScenes includes a diverse array of scenes, such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.

Details →

Usage examples

See 3 usage examples →

OpenNeuro

biologyimaginglife sciencesneurobiologyneuroimaging

OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by th...

Details →

Opioid Industry Documents Archive (OIDA) Data on AWS

archiveslife sciencespharmaceuticaltext analysistxt

The OIDA Data on AWS contain the metadata, documents, and extracted text for all of the documents in the UCSF-JHU Opioid Industry Documents Archive, a growing corpus of internal corporate records and other documents arising from the opioid industry.

Details →

Physionet

biologylife sciences

PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).

Details →

Smithsonian Open Access

artcultureencyclopedichistorymuseum

The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects,...

Details →

SocialGene RefSeq Databases

amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein

Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.

Details →

Usage examples

See 3 usage examples →

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

Details →

UCSC Genome Browser Sequence and Annotations

bioinformaticsbiologygeneticgenomiclife sciences

The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...

Details →

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflower's genetic diversity originating from thousands of wild, cultivated, and landrace sunflower individuals distributed across North America.The data consists of raw sequences and associated botanical metadata, aligned sequences (to three different reference genomes), and sets of SNPs computed across several cohorts.

Details →

Will Two Do? Varying Dimensions in Electrocardiography: The PhysioNet/Computing in Cardiology Challenge 2021

The electrocardiogram (ECG) is a non-invasive representation of the electrical activity of the heart. Although the twelve-lead ECG is the standard diagnostic screening system for many cardiological issues, the limited accessibility of twelve-lead ECG devices provides a rationale for smaller, lower-cost, and easier to use devices. While single-lead ECGs are limiting [1], reduced-lead ECG systems hold promise, with evidence that subsets of the standard twelve leads can capture useful information [2], [3], [4] and even be comparable to twelve-lead ECGs in some limited contexts. In 2017 we challen...

Details →

SatPM2.5

air qualityatmosphereenvironmentalhealthnetcdf

Fine particulate matter (PM2.5) concentrations are estimated using information from satellite-, simulation- and monitor-based sources. Aerosol optical depth from multiple satellites (MODIS, VIIRS, MISR, SeaWiFS, and VIIRS) and their respective retrievals (Dark Target, Deep Blue, MAIAC) is combined with simulation (GEOS-Chem) based upon their relative uncertainties as determined using ground-based sun photometer (AERONET) observations to produce geophysical estimates that explain most of the variance in ground-based PM2.5 measurements. A subsequent statistical fusion incorporates additional inf...

Details →

Usage examples

See 2 usage examples →

Global Carbon Budget Data

climatelandoceans

The Global Carbon Budget (GCB) is recognised globally as the most comprehensive report on global carbon emissions and sinks. This dataset, updated every year, includes estimates of land and ocean carbon fluxes from the suite of models used in the report.

Details →

Usage examples

  • Global Carbon Budget 2023 by Pierre Friedlingstein, Michael O’Sullivan, Matthew W. Jones, Robbie M. Andrew, Luke Gregor, Judith Hauck, Corinne Le Quéré, Ingrid T. Luijkx, Are Olsen, Glen P. Peters, Wouter Peters, Julia Pongratz, Clemens Schwingshackl, Stephen Sitch, Josep G. Canadell, Philippe Ciais, Rob B. Jackson,Simone Alin, Ramdane Alkama, Almut Arneth, Vivek K. Arora, Nicholas R. Bates, Meike Becker, Nicolas Bellouin, Henry C. Bittig, Laurent Bopp, Frédéric Chevallier, Louise P. Chini, Margot Cronin, Wiley Evans, Stefanie Falk, Richard A. Feely, Thomas Gasser, Marion Gehlen, Thanos Gkritzalis, Lucas Gloege, Giacomo Grassi, Nicolas Gruber, Özgür Gürses, Ian Harris, Matthew Hefner, Richard A. Houghton, George C. Hurtt, Yosuke Iida, Tatiana Ilyina, Atul K. Jain, Annika Jersild, Koji Kadono, Etsushi Kato, Daniel Kennedy, Kees Klein Goldewijk, Jürgen Knauer, Jan Ivar Korsbakken, Peter Landschützer, Nathalie Lefèvre, Keith Lindsay, Junjie Liu, Zhu Liu, Gregg Marland, Nicolas Mayot, Matthew J. McGrath, Nicolas Metzl, Natalie M. Monacci, David R. Munro, Shin-Ichiro Nakaoka, Yosuke Niwa, Kevin O´Brien, Tsuneo Ono, Paul I. Palmer, Naiqing Pan, Denis Pierrot, Katie Pocock, Benjamin Poulter, Laure Resplandy, Eddy Robertson, Christian Rödenbeck, Carmen Rodriguez, Thais M. Rosan, Jörg Schwinger, Roland Séférian, Jamie D. Shutler, Ingunn Skjelvan, Tobias Steinhoff, Qing Sun, Adrienne J. Sutton, Colm Sweeney, Shintaro Takao, Toste Tanhua, Pieter P. Tans, Xiangjun Tian, Hanqin Tian, Bronte Tilbrook, Hiroyuki Tsujino, Francesco Tubiello, Guido R. van der Werf, Anthony P. Walker, Rik Wanninkhof, Chris Whitehead, Anna Wranne, Rebecca Wright, Wenping Yuan, Chao Yue, Xu Yue, Sönke Zaehle, Jiye Zeng, Bo Zheng

See 1 usage example →

mirrulations

government records

The regulations.gov website allows users to view proposed rules and supporting documents for the federal rule-making process. In addition, users can post and view comments about those proposed rules. The site contains about 27 million pieces of text and binary data, but the API that provides access only allows a user to obtain one thousand items per hour. As a result, it would take approximately 3 years to download all the data. Mirrulations (MIRRor of regULATIONS.gov) is a system that uses a collection of donated API keys to create a mirror of the data. In addition, for each pdf in the da...

Details →

Usage examples

See 1 usage example →