Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with bioinformatics.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with bioinformatics.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

The Human Sleep Project

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...

Usage examples

Algorithm for automatic detection of self-similarity and prediction of residual central respiratory events during continuous positive airway pressure. Sleep. 2021 Apr 9;44(4):zsaa215. doi: 10.1093/sleep/zsaa215. PMCID: PMC8631077. by Oppersma E, Ganglberger W, Sun H, Thomas RJ*, Westover MB*
Classification algorithms for predicting sleepiness and sleep apnea severity. Journal of Sleep Research. 2012 Feb;21(1):101-12. PMCID: PMC3698244. by Eiseman NA, Westover MB, Mietus JE, Thomas RJ, Bianchi MT
Optimal Spindle Detection Parameters for Predicting Cognitive Performance. Sleep. 2022 Jan 4:zsac001. doi: 10.1093/sleep/zsac001. Epub ahead of print. PMCID: PMC8996023. by Adra N, Sun H, Ganglberger W, Ye EM, Dümmer LW, Tesh RA, et al.
Age estimation from sleep studies using deep learning predicts life expectancy. NPJ Digit Med. 2022 Jul 22;5(1):103. doi: 10.1038/s41746-022-00630-9. PMCID: PMC9307657. by Brink-Kjaer A, Leary EB, Sun H, Westover MB, Stone KL, Peppard PE, et al.
The Challenge of Undiagnosed Sleep Apnea in Low-Risk Populations: A Decision Analysis. Military Medicine 2014 Aug;179(8S):47-54. PMCID: PMC6788752. by Bianchi MT, Hershman S, Bahadoran M, Ferguson M, Westover MB.

See 37 usage examples →

CELLxGENE Discover Census

Biohubbioinformaticscell biologylife sciencessingle-cell transcriptomicstranscriptomics

CELLxGENE Discover (cellxgene.cziscience.com) is a free-to-use platform for the exploration, analysis, and retrieval of single-cell data. CELLxGENE Discover hosts the largest aggregation of standardized single-cell data from the major human and mouse tissues, with modalities that include gene expression, chromatin accessibility, DNA methylation, and spatial transcriptomics. This year, CELLxGENE Discover has made available all of its human and mouse RNA single-cell data through Census (https://chanzuckerberg.github.io/cellxgene-census/) – a free-to-use service with an API and data that allows f...

Usage examples

See 19 usage examples →

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, 4.2, and 4.4

bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing

Overview

This dataset contains alignment files and small variant (includes single nucleotide variants (SNV) and indels), copy number variant (CNV), short tandem repeat (i.e., repeat expansion; STR), structural variant (SV) and other variant call files from the 1000 Genomes Project (1KGP) Phase 3 dataset (3,202 individuals, 602 trios) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, v4.2.7, and v4.4.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more infor

...

Usage examples

DRAGEN Support Resources by Illumina Inc.
DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes by Illumina Inc. (2020)
Accurate and efficient calling of small and large variants from popgen datasets using the DRAGEN Bio-IT Platform by Illumina Inc. (2021)
Overcoming high homology to detect variation in *CYP21A2* with whole-genome sequencing in DRAGEN by Illumina Inc. (2023)
Illumina Connected Analytics User Guide by Illumina Inc.

See 17 usage examples →

Cell Painting Gallery

bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle

The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained ...

Usage examples

Deep Profiler - Morphological profiling using deep learning by Multiple Authors
Accelerating Drug Discovery with high-throughput Cell Painting on AWS by Chris Kaspar
Image-based Profiling Handbook - for processing image-based profiling datasets using CellProfiler and pycytominer by Multiple Authors
Multiplex Cytological Profiling Assay to Measure Diverse Cellular States by Gustafsdottir SM, Ljosa V, Sokolnicki KL, Wilson JA, Walpita D, Kemp MM, Seiler KP, Carrel HA, Golub TR, Schreiber SL, Clemons PA, Carpenter AE, and Shamji AF
Image-based Profiling Recipe by Multiple Authors

See 17 usage examples →

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies. The gnomAD Principal Investigators and team can be found Details →

Usage examples

gnomAD quality control GitHub repository by gnomAD Production Team
Characterising the loss-of-function impact of 5’ untranslated region variants in 15,708 individuals. Nature Communications 11, 2523 (2020) by Whiffin, N., Karczewski, K. J., Zhang, X., Chothani, S., Smith, M. J., Gareth Evans, D., Roberts, A. M., Quaife, N. M., Schafer, S., Rackham, O., Alföldi, J., O’Donnell-Luria, A. H., Francioli, L. C., Genome Aggregation Database (gnomAD) Production Team, Genome Aggregation Database (gnomAD) Consortium, Cook, S. A., Barton, P. J. R., MacArthur, D. G., & Ware, J. S.
Evaluating potential drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020) by Minikel, E. V., Karczewski, K. J., Martin, H. C., Cummings, B. B., Whiffin, N., Rhodes, D., Alföldi, J., Trembath, R. C., van Heel, D. A., Daly, M. J., Genome Aggregation Database Production Team, Genome Aggregation Database Consortium, Schreiber, S. L., & MacArthur, D. G.
Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nature Communications 11, 2539 (2020) by Wang, Q., Pierce-Hoffman, E., Cummings, B. B., Karczewski, K. J., Alföldi, J., Francioli, L. C., Gauthier, L. D., Hill, A. J., O’Donnell-Luria, A. H., Genome Aggregation Database (gnomAD) Production Team, Genome Aggregation Database (gnomAD) Consortium, & MacArthur, D. G.
A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024) by Chen, S., Francioli, L. C., Goodrich, J. K., Collins, R. L., Wang, Q., Alföldi, J., Watts, N. A., Vittal, C., Gauthier, L. D., Poterba, T., Wilson, M. W., Tarasova, Y., Phu, W., Yohannes, M. T., Koenig, Z., Farjoun, Y., Banks, E., Donnelly, S., Gabriel, S., Gupta, N., Ferriera, S., Tolonen, C., Novod, S., Bergelson, L., Roazen, D., Ruano-Rubio, V., Covarrubias, M., Llanwarne, C., Petrillo, N., Wade, G., Jeandet, T., Munshi, R., Tibbetts, K., gnomAD Project Consortium, O’Donnell-Luria, A., Solomonson, M., Seed, C., Martin, A. R., Talkowski, M. E., Rehm, H. L., Daly, M. J., Tiao, G., Neale, B. M., MacArthur, D. G. & Karczewski, K. J.

See 15 usage examples →

The Singapore Nanopore Expression Data Set

bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics

The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...

Usage examples

Detection of m6A from direct RNA sequencing using a Multiple Instance Learning framework. by Christopher Hendra et al.
JAFFAL: Detecting fusion genes with long read transcriptome sequencing. by Nadia M Davidson et al.
Accessing the SG-NEx dataset on AWS by Ying Chen
nf-core/nanoseq: A nanopore DNA and RNA-Seq demultiplexing, QC, alignment and analysis pipeline by Chelsea Sawyer et al.
Performing transcript discovery and quantification with Bambu by Min Hao Ling

See 15 usage examples →

Open Targets

bioinformaticsbiologydrug discoverygeneticgenomiclife sciencesprotein

The Open Targets Platform is a comprehensive data integration tool that supports systematic identification and prioritisation of potential therapeutic drug targets. By integrating publicly available datasets including data generated by the Open Targets experimental and informatics research programmes, the Platform provides data and services to assist in the task of therapeutic hypothesis building.

Usage examples

See 11 usage examples →

The Cancer Dependency Map (DepMap) Cancer Cell Line Encyclopedia (CCLE) Dataset

bambioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesshort read sequencingtranscriptomicswhole exome sequencingwhole genome sequencing

This dataset consists of whole genome sequencing (WGS), whole exome sequencing (WES), and RNA sequencing files generated from ~1000 cancer cell lines described in Ghandi et al., 2019.

Usage examples

The present and future of the Cancer Dependency Map by Arafeh, Shibue, Dempster et al.
Integrated cross-study datasets of genetic dependencies in cancer by Pacini, Dempster, Boyle et al.
Machine learning multi-omics analysis reveals cancer driver dysregulation in pan-cancer cell lines compared to primary tumors by Sanders, Chandra, Zebarjadi et al.
The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks by Ben Guebila, Wang, Lopes-Ramos et al.
Genetic dependencies associated with transcription factor activities in human cancer cell lines by Thatikonda, Supper, Wachter et al.

See 11 usage examples →

Alliance of Genome Resources

bioinformaticsbiologyCaenorhabditis elegansDanio rerioDrosophila melanogasterfastagene expressiongeneticgenomegenomicHomo sapienslife sciencesMus musculusproteinRattus norvegicustranscriptomicsvcf

The Alliance of Genome Resources is a consortium that integrates genomic, genetic, and molecular data from leading model organism databases including Drosophila melanogaster, Caenorhabditis elegans, Danio rerio (zebrafish), Mus musculus (mouse), Rattus norvegicus (rat), Saccharomyces cerevisiae (yeast), Xenopus laevis and Xenopus tropicalis (frogs), and human reference data. The Alliance provides comprehensive datasets including gene annotations, disease associations, expression data (bulk and single-cell RNA-Seq), protein and genetic interactions, orthology relationships, variants and alleles...

Usage examples

RGD - Rat Genome Database by RGD
Alliance of Genome Resources Portal - unified model organism research platform by Alliance of Genome Resources Consortium
SGD - Saccharomyces Genome Database by SGD
FlyBase - Drosophila Database by FlyBase Consortium
Xenbase - Xenopus Database by Xenbase

See 10 usage examples →

Garvan Institute Long Read Sequencing Benchmark Data

bioinformaticsgenomiclife scienceslong read sequencing

The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Examples include: NA12878 (HG001) and NA24385 (HG002) sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells; and, UHR RNA (direct-RNA) on an ONT PromethION using the latest RNA004 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for ...

Usage examples

Slow5lib: toolkit slow5lib is a software library for reading & writing SLOW5 files. by Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al.
buttery-eel: The buttery eel - a slow5 guppy basecaller wrapper by Samarakoon, H., Ferguson, J.M., Gamaarachchi H. et al.
Streamlining remote nanopore data access with slow5curl by Wong, B., Ferguson, J.M., Do, J.Y. et al.
Fast nanopore sequencing data analysis with SLOW5. by Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al.
Flexible and efficient handling of nanopore sequencing signal data with slow5tools. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.

See 10 usage examples →

PubSeq - Public Sequence Resource

bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL

COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.

Usage examples

See 9 usage examples →

Steinegger Lab Datasets

bioinformaticslife sciencesmetagenomicsopen source softwareproteinprotein folding

The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond ...

Usage examples

Fast and accurate protein structure search with Foldseek by van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al.
ColabFold Google Colab Notebook by Ovchinnikov S, Mirdita M and Steinegger M
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets by Steinegger M and Söding J
Foldseek User Guide by Mirdita M and Steinegger M
Run ColabFold on your local computer by Moriwaki Y

See 9 usage examples →

NIH Roadmap Epigenomics

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The project has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility, DNA methylation and mRNA expression across 100s of human cell types and tissues. To see what data is available, please check the directory listing: https://roadmapepigenomics.s3.us-west-2.amazonaws.com/index.html.

Usage examples

Human body epigenome maps reveal noncanonical DNA methylation variation by Matthew D. Schultz, Yupeng He, John W. Whitaker, Manoj Hariharan, Eran A. Mukamel, Danny Leung, Nisha Rajagopal, Joseph R. Nery, Mark A. Urich, Huaming Chen, Shin Lin, Yiing Lin, Inkyung Jung, Anthony D. Schmitt, Siddarth Selvaraj, Bing Ren, Terrence J. Sejnowski, Wei Wang & Joseph R. Ecker
Chromatin architecture reorganization during stem cell differentiation by GJesse R. Dixon, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E. Antosiewicz-Bourget, Ah Young Lee, Zhen Ye, Audrey Kim, Nisha Rajagopal, Wei Xie, Yarui Diao, Jing Liang, Huimin Zhao, Victor V. Lobanenkov, Joseph R. Ecker, James A. Thomson & Bing Ren
Visualize Roadmp data with WashU Epigenome Browser by WashU Epigenome Browser
The epigenomic landscape of transposable elements across normal human development and anatomy by Erica C. Pehrsson, Mayank N. K. Choudhary, Vasavi Sundaram & Ting Wang
WashU Epigenome Browser update 2019 by Daofeng Li, Silas Hsu, Deepak Purushotham, Renee L Sears and Ting Wang

See 8 usage examples →

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Usage examples

Visualize TaRGET II data with WashU Epigenome Browser by WashU Epigenome Browser
Finding and Downloading TaRGET II Data files by TaRGET-DCC
The NIEHS TaRGET II Consortium and environmental epigenomics by Wang, T., Pehrsson, E., Purushotham, D. et al.
The role of environmental exposures and the epigenome in health and disease. by Perera BPU, Faulk C, Svoboda LK, Goodrich JM, Dolinoy DC.
Comparison of differential accessibility analysis strategies for ATAC-seq data by Gontarz P, Fu S, Xing X, Liu S, Miao B et.al.

See 8 usage examples →

U.S. Environmental Protection Agency (EPA) Center for Computational Toxicology and Exposure High Throughput Transcriptomics Data

bioinformaticsfastqgene expressiontranscriptomics

High-throughput transcriptomics (HTTr) data generated by US EPA Office of Research and Development, Center for Computational Toxicology and Exposure (CCTE), Biomolecular and Computational Toxicology Division. All data is generated using TempO-Seq targeted RNA-seq technology from in vitro cell culture systems.

Usage examples

High-Throughput Transcriptomics Platform for Screening Environmental Chemicals by Harrill J., Everett L., Haggard D., Sheffield T., Bundy J., Willis C., et al
High-Throughput Transcriptomics of Water Extracts Detects Reductions in Biological Activity with Water Treatment Processes by Rogers J., Leusch F., Chambers B., Daniels K., Everett L., Judson R., et al
Benchmark Dose Modeling Approaches for Volatile Organic Chemicals Using a Novel Air-Liquid Interface In Vitro Exposure System by Speen A., Murray J., Krantz Q., Davies D., Evansky P., Harrill J., et al
Exploring the Effects of Experimental Parameters and Data Modeling Approaches on In Vitro Transcriptomic Point-of-Departure Estimates by Harrill J., Everett L., Haggard D., Bundy J., Willis C., Shah I., et al
Combining phenotypic profiling and targeted RNA-Seq reveals linkages between transcriptional perturbations and chemical effects on cell morphology: Retinoic acid as an example by Nyffeler J., Willis C., Harris F., Taylor L., Judson R., Everett L., et al

See 8 usage examples →

SnpEff & SnpSift Genomic Variant Annotation Databases

bioinformaticscancergeneticgenomegenomiclife sciencesproteinstructural variationtranscriptomicsvariant annotationvcfwhole exome sequencingwhole genome sequencing

SnpEff is a variant annotation and effect prediction tool that annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes). It supports over 38,000 genomes and provides comprehensive genomic databases for variant annotation. The databases include reference genomes, gene annotations, protein sequences, and regulatory elements from trusted sources like ENSEMBL, RefSeq, and UCSC. SnpSift complements SnpEff by providing tools to annotate genomic variants using databases, filter large genomic datasets, and manipulate annotated variants. Together, these ...

Usage examples

See 7 usage examples →

ESM Atlas — Protein Features and Structures

Biohubbioinformaticslife sciencesmachine learningmetagenomicsproteinstructural biology

The ESM Atlas is a large-scale public dataset of computational outputs generated by ESMC and ESMFold2, derived from a deduplicated set of over 6.8 billion publicly available protein sequences spanning all domains of life — including viral proteins and previously unannotated sequences representing metagenomic dark matter sampled from a wide range of biomes. The dataset includes two primary components. A sparse autoencoder (SAE) features for ~6.8 billion proteins, capturing interpretable biological representations from the ESMC 6B model, and predicted three-dimensional protein structures for ~1....

Usage examples

See 6 usage examples →

Open Bioinformatics Reference Data for Galaxy

bioinformaticsbiologygeneticgenomiclife sciencesreference index

This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...

Usage examples

Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages by Tiago C. Silva, Antonio Colaprico, Catharina Olsen, Fulvio D'Angelo, Gianluca Bontempi, Michele Ceccarelli, Houtan Noushmehr
Galaxy by Galaxy Project
Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
Bioconductor by Bioconductor Project

See 6 usage examples →

Caenorabditis Diversity Natural Resource

bambioinformaticsbiologyCaenorhabditis elegansfastqgatk-svgenetic mapsgenomegenome wide association studygenomiclife sciencesshort read sequencingvariant annotationvcf

The Caenorhabditis Natural Diversity Resource (CaeNDR) is a data repository and analysis hub of wild strains of selfing Caenhorabditis species C. elegans, C. briggsae, and C. tropicalis from around the world to facilitate discovery of genetic variation across all three species through genome-wide association mappings to correlate genotype with phenotype and identify genetic variation underlying quantitative traits.

Usage examples

CaeNDR, the Ceanorhabditis Natural Diversity Resource by Crombie TA, McKeown R, Moya ND, Evans KS, Widmayer SJ, LaGrassa V, et al.
Data Releases - C. briggsae by Erik Andersen
Data Releases - C. tropicalis by Erik Andersen
FAQ - AWS API by Erik Andersen
Data Releases - C. elegans by Erik Andersen

See 5 usage examples →

Protein Data Bank 3D Structural Biology Data

amino acidarchivesbioinformaticsbiomolecular modelingcell biologychemical biologyCOVID-19electron microscopyelectron tomographyenzymelife sciencesmoleculenuclear magnetic resonancepharmaceuticalproteinprotein templateSARS-CoV-2structural biologyx-ray crystallography

The "Protein Data Bank (PDB) archive" was established in 1971 as the first open-access digital data archive in biology. It is a collection of three-dimensional (3D) atomic-level structures of biological macromolecules (i.e., proteins, DNA, and RNA) and their complexes with one another and various small-molecule ligands (e.g., US FDA approved drugs, enzyme co-factors). For each PDB entry (unique identifier: 1abc or PDB_0000001abc) multiple data files contain information about the 3D atomic coordinates, sequences of biological macromolecules, information about any small molecules/ligan...

Usage examples

PDB 101 by RCSB PDB
Protein Data Bank: the single global archive for 3D macromolecular structure data by wwPDB consortium
Announcing the worldwide Protein Data Bank by Berman, H., Henrick, K. & Nakamura, H.
File Download Services by RCSB PDB
Get to Know a Dataset: Protein Data Bank 3D Structural Biology Data by RCSB PDB

See 5 usage examples →

SPARC: Datasets bridging the body and the brain

bioinformaticselectrophysiologylife sciencesmicroscopyneurophysiologyneuroscience

The SPARC Datasets comprise a collection of scientific data that is focused on bridging the body and the brain. The datasets focus on neural connectivity, organ innervation and detailed anatomical mapping of the peripheral nervous system. SPARC datasets distinguish themselves from other data resources through its multi-modal approach to scientific data and integrates molecular, imaging, timeseries and other datatypes associated with the interaction between the peripheral nervous system and organs. SPARC data provides a unique integrated effort to develop next generation mapping of anatomical ...

Usage examples

Downloading large scale SPARC datasets by The SPARC Data and Resource Center
The Pennsieve Data Management Platform by Joost Wagenaar
The SPARC Portal by Peter Hunter, Maryann Martone, Esra Neufeld, Joost Wagenaar
OSPARC by Esra Neufeld
Download public data, scaffolds and run computations by The SPARC Data and Resource Center

See 8 usage examples →

BUSCO Datasets

assemblybacteriabioinformaticsgenomiclife sciencesmetagenomicsopen source softwareproteinvirus

Lineage datasets for use with BUSCO software package. Each dataset contains HMM profiles for clade specific, universal, single-copy marker genes. Datasets are available across archaea, bacteria, eukaryota and virus domains. The repository also includes necessary data files for phylogenetic placement of an input assembly.

Usage examples

BUSCO - assessing genomic data quality and beyond. by Mosè Manni, Matthew R. Berkeley, Mathieu Seppey, Evgeny M. Zdobnov
BUSCO - from QC to gene prediction and phylogenomics by Matthew Berkeley
OrthoDB and BUSCO update - annotation of orthologs with wider sampling of genomes. by Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva
BUSCO Update - Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. by Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, Evgeny M Zdobnov

See 4 usage examples →

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indexSTRIDEStranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Usage examples

BLAST+ Docker by NCBI BLAST
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman

See 4 usage examples →

Encyclopedia of DNA Elements (ENCODE)

bioinformaticsbiologygeneticgenomiclife sciences

The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a ...

Usage examples

See 4 usage examples →

Epigenomes of the Human Pangenome Reference Consortium (HPRC) Release 2

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The Human Pangenome Reference Consortium (HPRC) Release 2 represents a landmark achievement in genomics, providing high-quality phased genome assemblies from over 200 individuals with comprehensive functional genomics data. The HPRC Epigenome Browser provides researchers a way to explore all epigenomics data generated by release 2. The HPRC Epigenome Browser (HPRCEB) is a modern, interactive web portal that democratizes access to HPRC Release 2 epigenomics data through an intuitive interface supporting genome selection, data visualization, and bulk download capabilities. The portal integrates ...

Usage examples

"Get To Know A Dataset: HPRC Epigenome" by HPRC Epigenome Browser
WashU Epigenome Browser update 2025 by Chanrung Seng, Shane Liu, Wenjin Zhang, Xiaoyu Zhuo, Daofeng Li, Ting Wang
A draft human pangenome reference by Liao, WW., Asri, M., Ebler, J. et al.
"Modbed track: Visualization of modified bases in single-molecule sequencing" by Daofeng Li, Xiaoyu Zhuo, Jessica K. Harrison, Shane Liu, Ting Wang

See 4 usage examples →

Epilepsy.Science

bioinformaticselectrophysiologylife sciencesmedicineneuroscience

Epilepsy.Science comprise a set of datasets focused on Epilepsy Research that span both Clinical Data and Pre-clinical data. Datasets are contributed by the Epilepsy Research community and published using a standardized structure and metadata. Clinical datasets include de-identified subject information, EEG, and clinical imaging.

Usage examples

The Pennsieve Data Management Platform by Joost Wagenaar
The Epilepsy.Science Portal by Joost Wagenaar, Brandon Westover, Kathryn Davis, Nishant Sinha, Brian Litt
Submitting a dataset proposal by Pennsieve
Pennsieve Open Repositories by Pennsieve

See 4 usage examples →

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Usage examples

See 4 usage examples →

Synthea synthetic patient generator data in OMOP Common Data Model

bioinformaticshealthlife sciencesnatural language processingus

The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and gov...

Usage examples

Create data science environments on AWS for health analysis using OHDSI by James Wiggins
Predict patient health outcomes using OHDSI and machine learning on AWS by James Wiggins
Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon Comprehend Medical by James Wiggins
OHDSIonAWS by James Wiggins

See 4 usage examples →

The Impact of Variation on Function Consortium (IGVF)

bioinformaticsbiologygeneticgenomiclife sciences

The IGVF (Impact of Genomic Variation on Function) Consortium aims to understand how genomic variation affects genome function, which in turn impacts phenotype. The NHGRI is funding this collaborative program that brings together teams of investigators who will use state-of-the-art experimental and computational approaches to model, predict, characterize and map genome function, how genome function shapes phenotype, and how these processes are affected by genomic variation. These joint efforts will produce a catalog of the impact of genomic variants on genome function and phenotypes.
The Da...

Usage examples

See 4 usage examples →

AdaptiveFlow Ligand Libraries

bioinformaticslife sciencesmedicinepharmaceuticalstructural biology

AdaptiveFlow Versions of Ligand Libraries in Ready-To-Dock Format

Usage examples

See 3 usage examples →

DeepDrug Protein Embeddings Bank (DPEB)

bioinformaticslife sciencesmachine learningproteinstructural biology

DPEB is a multimodal database of human protein embeddings integrating four biologically complementary representations—AlphaFold2, BioEmbeddings, ESM-2, and ProtVec—designed for enhanced protein-protein interaction prediction and functional classification.

Usage examples

See 3 usage examples →

I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.

Usage examples

The International Cardiac Arrest Research (I-CARE) Consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.
I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.

See 3 usage examples →

Imaging MIT Licensed data and models

biodiversityBiohubbioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at Biohub and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

CELL-Diff: Unified diffusion modeling for protein sequences and microscopy images by Zheng Dihan, Bo Huang
Quickstart Tutorial for CELL-Diff by Biohub
Documentation for CELL-Diff by Biohub
SubCell: Vision foundation models for microscopy capture single-cell biology by Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N Hansen, William D. Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera Navarro, Theofanis Karaletsos, Emma Lundberg
Quickstart Tutorial for SubCell by Biohub

See 6 usage examples →

Kraken2 NCBI RefSeq Complete V205 database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205

Usage examples

Kraken2 by Derrick Wood, Jennifer Lu and Ben Langmead
Using an Amazon Machine Image for analysing samples with Kraken2 by Robyn Wright
From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools by Robyn J. Wright, Andre M. Comeau and Morgan G.I. Langille

See 3 usage examples →

MIMIC-III (‘Medical Information Mart for Intensive Care’)

bioinformaticshealthlife sciencesnatural language processingus

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...

Usage examples

MIMIC-code GitHub repository by Alistair Johnson
Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena by James Wiggins, Alistair Johnson
Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data by Ujjwal Ratan, Nihir Chadderwala, and Parminder Bhatia

See 3 usage examples →

NASA Space Biology Open Science Data Repository (OSDR)

bioinformaticsbiologyGeneLabgenomicimaginglife sciencesspace biology

NASA’s Space Biology Open Science Data Repository (OSDR) introduces a one-stop site where users can explore and contribute a variety of NASA open science biological data. This site consolidates data from the Ames Life Sciences Data Archive (ALSDA) and GeneLab and includes information about the broader NASA Open Science and Open Data initiatives, all at one centralized location. Our mission is to maximize the utilization of the valuable biological research resources and enable new discoveries.

OSDR introduces access to data generated from spaceflight and space relevant experiments that explore ...

Usage examples

GeneLab: Omics database for spaceflight experiments by Shayoni Ray, Samrawit Gebre, Homer Fogle, Daniel C Berrios, Peter B Tran, Jonathan M Galazka, Sylvain V Costes
NASA GeneLab: interfaces for the exploration of space omics data by Daniel C Berrios, Jonathan Galazka, Kirill Grigorev, Samrawit Gebre, Sylvain V Costes
Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration by Ryan T. Scott, Kirill Grigorev, Graham Mackintosh, Samrawit G. Gebre, Christopher E. Mason, Martha E. Del Alto, Sylvain V. Costes

See 3 usage examples →

ONT Methylation Benchmarking Datasets

bambenchmarkbioinformaticsepigenomicsgenomiclife scienceslong read sequencing

ONT Methylation Benchmarking Datasets are generated to benchmark existing methylation-calling tools on the Oxford Nanopore sequencing platform using their recent R10.4.1 flowcell chemistry. It spans a diverse range of species, including bacteria (E. coli, H. pylori J99, H. pylori 26695, A. variabilis, T. denticola), plants (Rice, Arabidopsis), and mammals (mouse, human).In addition, the dataset includes EMSeq data for E. coli, plant, and mouse samples, which can serve as ground truth for methylation studies. It also provides unmethylated whole-genome amplified (WGA) DNA for H. pylori 26695 and...

Usage examples

Running Benchmarking Pipeline (Nextflow/Snakemake) on an Example Dataset using AWS by Onkar Kulkarni
Methylation calling using ONT methylation benchmarking dataset by Onkar Kulkarni
Comprehensive benchmarking of tools for nanopore-based detection of DNA methylation by Kulkarni et al.

See 3 usage examples →

Open Human Genome Library

bioinformaticsbiologygenomiclife sciences

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

Usage examples

AGC: compact representation of assembled genomes with fast queries and updates by Sebastian Deorowicz, Agnieszka Danek, Heng Li
BWT construction and search at the terabase scale by Heng Li
Using OpenHGL data by Heng Li

See 3 usage examples →

ProteinGym

bioinformaticsbiologydeep learninglife sciencesmachine learningprotein

ProteinGym is a benchmark suite for assessing the performance of protein fitness prediction and design models. It comprises a large curated collection of 200+ high-throughput experimental assays (~3M mutated sequences), as well as clinical annotations from experts about the pathogenicity of mutants in over 3k human genes.

Usage examples

ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design by Pascal Notin, et al.
Scoring ProteinGym assays with TranceptEVE by Daniel Ritter
ProteinGym website by Pascal Notin & Daniel Ritter

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Usage examples

See 3 usage examples →

SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.

Usage examples

Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation by Jing J, Ge W, Hong S, Fernandes MB, Lin Z, Yang C et al., et al.
SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography by Jing, J., Ge, W., Struck, A. F., Fernandes, M., Hong, S., An, S., et al.
IIIC-SPaRCNet Github Repository by Brain Data Science Platform (BDSP)

See 3 usage examples →

SoilMicrobeDB genome database on AWS

bioinformaticslife sciencesmetagenomicsmicrobiome

SoilMicrobeDB is a Kraken2 genome database with extensive representation of high-quality genomes of soil organisms, including uncultured and fungal species.

Usage examples

See 3 usage examples →

run_dbcan CAZyme and CGC annotation database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with run_dbcan (CAZyme and CGC annotation), including CAZyme, Transporter, Transcription factor, Signaling Transduction Protein, Sulfatase, Peptidase, and Polysaccharide utilization Loci.

Usage examples

dbCAN3: automated carbohydrate-active enzyme and substrate annotation by Jinfang Zheng, Qiwei Ge, Yuchen Yan, Xinpeng Zhang, Le Huang, Yanbin Yin
run_dbcan Documentation by Xinpeng Zhang; Haidong Yi; Yanbin Yin
run_dbcan by Xinpeng Zhang; Haidong Yi; Jinfang Zheng; Le Huang; Qiwei Ge; Yanbin Yin

See 3 usage examples →

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...

Usage examples

See 2 usage examples →

Biodiversity Heritage Library Metadata and Page Images

biodiversitybioinformaticslife sciences

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access.

Usage examples

See 5 usage examples →

Broad Genome References

bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index

Broad maintained human genome reference builds hg19/hg38 and decoy references.

Usage examples

Advancing NGS quality control to enable measurement of actionable mutations in circulating tumor DNA by Willey J. C., Morrison T. B., Austermiller B., Crawford E. E., et al (2021)
Using Amazon FSx for Lustre for Genomics Workflows on AWS by W. Lee Pang

See 2 usage examples →

COVID-19 Data Lake

amazon.sciencebioinformaticsbiologycoronavirusCOVID-19healthlife sciencesmedicineMERSSARS

A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and rela...

Usage examples

See 5 usage examples →

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

bioinformaticsbiologygenomiclife sciencesmappingmedicinereference indexwhole genome sequencing

Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.

Usage examples

Table of contents for tutorials for constituent tools by Ben Langmead
Reducing reference bias using multiple population reference genomes by Chen et al (2020)

See 2 usage examples →

DNAStack COVID19 SRA Data

bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...

Usage examples

Viral AI by DNAstack
Viral lineage assignment by Heather Ward

See 2 usage examples →

E11bio PRISM

bioinformaticsbiologybrain imagescell imagingcomputer visionfluorescence imaginghigh-throughput imagingimage processingimagingion channelslife sciencesmachine learningmicroscopymorphological reconstructionsMus musculusneurobiologyneuroimagingneuroscienceproteinsegmentationzarr

This dataset was generated using E11.bio's PRISM technology (Protein Reconstruction and Identification through Multiplexing), a platform that combines viral barcoding, expansion microscopy, and iterative immunolabeling for large-scale neuronal reconstruction.Neurons in the mouse hippocampal CA3 were transduced with a library of adeno-associated viruses (AAVs) encoding diverse “protein bits”—small epitope tags that act as combinatorial barcodes. Tissue was then processed with an expansion microscopy protocol, physically enlarging the sample ~5× to achieve an effective voxel size of ~35 × 3...

Usage examples

See 2 usage examples →

EMBER Open Datasets

activity detectionactivity recognitionanalyticsbioinformaticsbrain imagesbrain modelscloud computingcomputer visiondeep learningelectrophysiologyGPSh5hdf5Homo sapiensjsonlife scienceslocalizationmachine learningmagnetic resonance imagingMus musculusneurobiologyneuroimagingneurophysiologyneurosciencenon-human primatesignal processingspeech processingzarr

This is data from, Ecosystem for Multi-modal Brain-behavior Experimentation and Research (EMBER), It contains time series behavioral and neuroscience data from animal and deidentified human subjects across multiple modalities.

Usage examples

Mapping the landscape of social behavior by Ugne Klibaite, Tianqing Li, Diego Aldarondo, Jumana F Akoad, Bence P Ölveczky, Timothy W Dunn.
Get To Know A Dataset - EMBER by EMBER Team

See 2 usage examples →

Emory Knee Radiograph (MRKR) dataset

bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...

Usage examples

Example Notebook by Emory-HITI
Emory Knee Radiograph Dataset by Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi.

See 2 usage examples →

GATK Structural Variation (SV) Data

bioinformaticsbiologycromwellgatk-svgeneticgenomiclife sciencesstructural variation

This dataset holds the data needed to run a structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data in AWS.

Usage examples

AWS Setup & Execution by Goldfinch Bio and Loka Inc.
Structural Variant Analysis on AWS with Amazon FSx for Lustre by Goldfinch Bio and Loka Inc.

See 2 usage examples →

Harvard Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University:Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH).

Usage examples

Harvard-EEG-Database-Tools by Brain Data Science Platform (BDSP)
Harvard Electroencephalography Database by Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., et al.

See 2 usage examples →

Harvard-Emory ECG Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.

Usage examples

WFDB Software Package by Moody, G., Pollard, T., & Moody, B.
Harvard Electroencephalography Database by Moura Junior, V.; Reyna, M.; Hong, S.; Gupta, A.; Ghanta, M.; Sameni, R., et al.

See 2 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Usage examples

See 2 usage examples →

Indexes for Kaiju

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

Usage examples

Fast and sensitive taxonomic classification for metagenomics with Kaiju by Peter Menzel et al (2016)
Quickstart Tutorial for downloading the index files and running Kaiju. by Peter Menzel

See 2 usage examples →

RNA structure by fragmentation frequency

bioinformaticsgenomiclife sciencestranscriptomics

The fragSTRUC project devises a software to extract RNA secondary structure information from Illumina datasets, based on divalent ions in standard RNA-seq library preparation fragmenting sequences at non-base-paired regions of RNA.

Usage examples

Accessing the fragSTRUC dataset on AWS by Yuk Kei Wan and Leonard Schärfen
fragSTRUC: RNA structure by fragmentation frequency by Yuk Kei Wan and Leonard Schärfen

See 2 usage examples →

Reference Indexes for krepp

bioinformaticslife sciencesmetagenomicsmicrobiomereference index

krepp is an alignment-free method for estimating distances and phylogenetic placement of individual reads to many thousands of reference genomes in a scalable manner using k-mers. This dataset includes k-mer-based indexes consisting of ultra-large reference genome sets that can be efficiently analyzed using krepp.

Usage examples

See 2 usage examples →

Sid Sijbrandij's osteosarcoma dataset

bioinformaticscancergenomiclife sciencesspatial omics

Sid was diagnosed with osteosarcoma in November 2022 and after running out of standard of care treatment options he began: maximum diagnostics, created new treatments, started doing treatments in parallel, and scaling this for others. This dataset includes the clinical and molecular data generated by and for Sid throughout this journey: longitudinal single cell and bulk RNA and DNA sequencing, spatial transcriptomics, residual disease testing (MRD), flow cytometry, imaging, and clinical lab results. This data is being freely shared with the community to advance cancer research.

Usage examples

Get To Know A Dataset: Sid Sijbrandij's osteosarcoma dataset by Marshall Thompson
Osteosarc.com by Sid's Osteosarcoma Team

See 2 usage examples →

Somatic Mosaicism across Human Tissues (SMaHT)

bambioinformaticsbiologygeneticgenomicimaginglife scienceswhole genome sequencing

The Somatic Mosaicism across Human Tissues (SMaHT) project is an NIH Common Fund consortium (2023-) aimed to comprehensively characterize somatic variation ("mosaicism") in normal human tissues. While most genetic studies have relied on blood-derived DNA, SMaHT captures the full spectrum of DNA variation across cell types, tissues, and organs from phenotypically normal individuals to better understand the role of somatic mosaicism in human development, aging, and disease progression.Researchers in the consortium develop and apply experimental and computational methods, paired with th...

Usage examples

Somatic Mosaicism across Human Tissues Data Portal by SMaHT Data Analysis Center (DAC)
The Somatic Mosaicism across Human Tissues Network by Coorens T, Oh J, Choi Y, Lim N, Zhao B, Voshall A et al.

See 2 usage examples →

UniProt

bioinformaticsbiologychemistryenzymegraphlife sciencesmoleculeproteinRDFSPARQL

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.

Usage examples

Exploring the UniProt protein knowledgebase with AWS Open Data and Amazon Neptune by Eric Greene, Rafa Xu, Yuan Shi (AWS)
UniProt SPARQL by Swiss-Prot Group at SIB Swiss Institute of Bioinformatics

See 2 usage examples →

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model

amazon.sciencebioinformaticshealthlife sciencesnatural language processingus

DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:

allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;
train researchers on the use and complexity of conducting anal

...

Usage examples

OHDSIonAWS by James Wiggins
Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon Comprehend Medical by James Wiggins
Create data science environments on AWS for health analysis using OHDSI by James Wiggins
Predict patient health outcomes using OHDSI and machine learning on AWS by James Wiggins

See 4 usage examples →

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...

Usage examples

Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example →

Conformational Space of Short Peptides

amino acidbioinformaticsbiomolecular modelinglife sciencesmolecular dynamicsproteinstructural biology

Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides.

Usage examples

Intro to Conformational Space of Short Peptides by Sebastian Bassi and Virginia Gonzalez

See 1 usage example →

Global Biodiversity Information Facility (GBIF) Species Occurrences

biodiversitybioinformaticsconservationearth observationlife sciences

The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the world's governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...

Usage examples

GBIF and Apache-Spark on AWS tutorial by John Waller

See 1 usage example →

LongBench - cross-platform reference dataset profiling cancer cell lines with bulk and single-cell approaches

bambenchmarkbioinformaticscancerfastqlife scienceslong read sequencingshort read sequencingsingle-cell transcriptomicsvcf

LongBench is a comprehensive benchmark dataset of the latest long-read transcriptomics technologies from Oxford Nanopore (ON) and Pacific Biosciences, alongside a comparison with next-generation sequencing from Illumina. We generated bulk and single-cell libraries from lung cancer cell lines which include different cancer subtypes to capture real biological variation. To further compare and assess sequencing platform performance, Sequins and SIRVs (Set 4) synthetic spike-ins have been included.

Usage examples

Benchmarking long-read DE gene and transcript analysis with edgeR by Yupei You

See 1 usage example →

OceanOmics

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciencesmetagenomics

Minderoo Foundation OceanOmics aims to establish environmental DNA (eDNA) as a tool to measure, understand, and protect oceans. OceanOmics mainly generates two types of data: eDNA sequencing data (metabarcoding, metagenomics), and genome assembly data (marine vertebrates).

Usage examples

Case-studies on using OceanOmics genomes and eDNA data by Philipp Bayer

See 1 usage example →

Oxford Nanopore Technologies Benchmark Datasets

bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

Usage examples

ONT Dataset Tutorials by EPI2MELabs

See 1 usage example →

Synthea Coherent Data Set

bioinformaticscsvdicomgenomichealthimaginglife sciencesmedicine

This is a synthetic data set that includes FHIR resources, DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. FHIR links all the data types together.

Usage examples

The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive Synthetic Health Record. by Walonoski J, Hall D, Bates KM, Farris MH, Dagher J, Downs ME, Sivek RT, Wellner B, Gregorowicz A, Hadley M, Campion FX, Levine L, Wacome K, Emmer G, Kemmer A, Malik M, Hughes J, Granger E, Russell S.

See 1 usage example →

recount3

bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.

Usage examples

recount3 quick start guide by Leonardo Collado-Torres

See 1 usage example →

BioLiP

bioinformaticschemistrylife sciencesmolecular dockingmoleculeproteinstructural biology

BioLiP is a semi-manually curated database for high-quality, biologically relevant ligand-protein binding interactions. The structure data are collected primarily from the Protein Data Bank (PDB), with biological insights mined from literature and other specific databases. BioLiP aims to construct the most comprehensive and accurate database for serving the needs of ligand-protein docking, virtual ligand screening and protein function annotation.

Usage examples

BioLiP API usage by Zhang Lab
BioLiP2: an updated structure database for biologically relevant ligand-protein interactions by Chengxin Zhang, Xi Zhang, Peter L Freddolino, and Yang Zhang
BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions by Jianyi Yang, Ambrish Roy, and Yang Zhang

See 3 usage examples →

Brain Data Science Database 1

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes a variety of types of clinical neuroscience data including electroencephalography (EEG) and polysomnography (PSG) recordings, and supporting data to enable research in diverse areas of clinical neuroscience such as epilepsy, delirium, coma, and sleep medicine. All data is de-identified and includes code to reproduce results in accompanying research publications. The data is available for non-commercial resea...

Brain Data Science Database 2

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes large-scale electroencephalography (EEG) and polysomnography (PSG) recordings, brain imaging data (MRI, CT, PET), and supporting data to enable research in diverse areas of clinical neuroscience such as epilepsy, delirium, coma, sleep depth, sleep-related breathing disorders, meditation, subarachnoid hemorrhage, cardiac arrest, neuroinfectious diseases, and audiology. All data is de-identified and includes a...

Brain Data Science Database 3

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes large-scale electroencephalography (EEG) and polysomnography (PSG) recordings, brain imaging data (MRI, CT, PET), and electronic health records (EHR) data supporting research in areas such as epilepsy, delirium, coma, sleep depth, sleep-related breathing disorders, burst suppression, meditation, subarachnoid hemorrhage, cardiac arrest, and neuroinfectious diseases. All data is de-identified and includes algo...

COVID-19 Molecular Structure and Therapeutics Hub

bioinformaticsbiologycoronavirusCOVID-19life sciencesmolecular dockingpharmaceutical

Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community. A community-driven data repository and curation service for molecular structures, models, therapeutics, and simulations related to computational research related to therapeutic opportunities for COVID-19 (caused by the SARS-CoV-2 coronavirus).

CartoStore

bioinformaticsgenomiclife sciencesspatial omicsspatial transcriptomics

Cross-Platform Repository for High-resolution Spatial Transcriptomics Datasets.

Usage examples

Example CartoStore Repository for Xenium Breast Cancer Dataset by Hyun Min Kang and Weiqiu Cheng
Cartloader Documentation by Hyun Min Kang and Weiqiu Cheng
CartoStore Overview by Hyun Min Kang and Weiqiu Cheng

See 3 usage examples →

GATK Test Data

bioinformaticsbiologycancergeneticgenomiclife sciences

The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute's Genome Analysis Toolkit (GATK).

GX database for NCBI Foreign Contamination Screen (FCS) Tool Suite

assemblybioinformaticsbiologycontaminationfastageneticgenomehealthlife sciencesSTRIDESwhole genome sequencing

Sequence database used by FCS-GX (Foreign Contamination Screen - Genome Cross-species aligner) to detect contamination from foreign organisms in genome sequences.

Genome Ark

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.

Imaging BSD licensed data and models

biodiversityBiohubbioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at Biohub and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Quickstart Tutorial for Cytoland by Biohub
Documentation for Cytoland by Biohub
Cytoland: robust virtual staining of landmark organelles by Liu, Hirata-Miyasaki, et al.

See 3 usage examples →

InRad COVID-19 X-Ray and CT Scans

bioinformaticscoronavirusCOVID-19healthlife sciencesmedicineSARS

This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.

MetaGraph Sequence Indexes

analysis ready databiodiversitybioinformaticsbiologyfastagenomegenomicgraphinformation retrievallife sciencesmedicinemetagenomicsmicrobiometranscriptomicswhole exome sequencingwhole genome sequencing

The MetaGraph Sequence Indexes dataset comprises full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA).All index files can be used with the MetaGraph framework for sequence search. Indexes can be jointly used for aggregated search in the cloud or can be individually downloaded...

Usage examples

Usage within AWS by Oleksandr Kulkov
CloudFormation stack with a Step Function for dataset queries via AWS Batch by Oleksandr Kulkov
A global metagenomic map of urban microbiomes and antimicrobial resistance by Danko D, Bezdan D, Afshin EE, Ahsanuddin S, Bhattacharya C, Butler DJ, Chng KE, Donnellan D, Hecht J, Jackson K, Kuchin K, Karasikov M, Lyons A, Mak L, Meleshko D, Mustafa H, et al.

See 3 usage examples →

Metagenomic reference libraries for Slacken

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiome

Metagenomic indexes for use with the Slacken taxonomic classification tool

Usage examples

Classifying metagenomic samples on AWS ElasticMapReduce by Johan Nyström-Persson
Slacken by Johan Nyström-Persson, Nishad Bapatdhar
Precise and scalable metagenomic profiling with sample-tailored minimizer libraries by Johan Nyström-Persson, Nishad Bapatdhar and Samik Ghosh

See 3 usage examples →

SocialGene RefSeq Databases

amino acidbioinformaticschemical biologygenomicgraphlife sciencesmetagenomicsmicrobiomepharmaceuticalprotein

Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.

Usage examples

See 3 usage examples →

UCSC Genome Browser Sequence and Annotations

bioinformaticsbiologygeneticgenomiclife sciences

The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflower's genetic diversity originating from thousands of wild, cultivated, and landrace sunflower individuals distributed across North America.The data consists of raw sequences and associated botanical metadata, aligned sequences (to three different reference genomes), and sets of SNPs computed across several cohorts.

iNaturalist Licensed Observation Images

biodiversitybioinformaticsconservationearth observationlife sciences

iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.

GenomeKit genomic data

bioinformaticsgenomegenomicHomo sapienslife sciencesMus musculusnon-human primateopen source softwareRattus norvegicusvariant annotation

GenomeKit is Deep Genomics’ Python library for fast and easy access to genomic resources such as sequence, data tracks, and annotations. The goal is to let machine learning researchers build data sets easily, and to be creative about how those data sets are designed. Out of the box, GenomeKit provides access to pre-built optimized genomic data files that are required for its operation.

Usage examples

See 2 usage examples →

Platinum Pedigree

bioinformaticsgenomicgenotypingHomo sapienslife scienceslong read sequencingwhole genome sequencing

The Platinum Pedigree Consortium (PCC) is a collaborative project to create a comprehensive reference for human genetic variation using a four-generation, 28-member family (CEPH-1463). We employed five different short and long-read sequencing technologies to generate phased assemblies and characterize both inherited and de novo variation, including at some of the most difficult to genotype genomic regions such as tandem repeats, centromeres, and the Y chromosome. This extensive "truth set" is publicly available and can be used to test and benchmark new algorithms and technologies to ...

Usage examples

See 2 usage examples →

AllTheBacteria

assemblybacteriabioinformaticsfastagenomiclife sciencesmicrobial genomicsshort read sequencingwhole genome sequencing

All bacterial isolate whole-genome sequencing data from INSDC, uniformly assembled, quality-controlled, annotated, and searchable.

Usage examples

AllTheBacteria - all bacterial genomes assembled, available and searchable by Hunt M, Lima L, Anderson D, Hawkey J, Shen W, Lees J, Iqbal I

See 1 usage example →

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Usage examples

An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example →