Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with life sciences.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with life sciences.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

The Human Sleep Project

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...

Usage examples

Automated Scoring of Respiratory Events in Sleep with a Single Effort Belt and Deep Neural Networks. IEEE Transactions on Biomedical Engineering. 2021 Dec 20;PP. doi: 10.1109/TBME.2021.3136753. Epub ahead of print. PMCID: PMC9119908. by Nassi TE, Ganglberger W, Sun H, Bucklin AA, Biswal S, van Putten MJAM, et al.
Sleep EEG-based Brain Age Index is a Biomarker for Dementia. JAMA Network Open. 2020 Sep 1;3(9):e2017357. doi: 10.1001/jamanetworkopen.2020.17357. PMID: 32986106; PMCID: PMC7522697. by Ye E, Sun H, Leone MJ, Paixao L, Thomas RJ, Lam AD, et al.
The Impact of Body Posture and Sleep Stages on Sleep Apnea Severity in Adults. Journal of Clinical Sleep Medicine. 2012 Dec 15;8(6):655-66A. PMCID: PMC3501662. by Eiseman NA, Westover MB, Ellenbogen JM, Bianchi MT
A cost-effectiveness analysis of nasal surgery to increase CPAP compliance in sleep apnea patients with nasal obstruction. The Laryngoscope. 2017 Apr;127(4):977-983. PMCID: PMC5483184. by Kempfle J, Westover MB, Bianchi MT.
Emergence of stable functional networks in long-term human EEG. The Journal of Neuroscience, Feb 2012; 32(8): 2703-2713. PMCID: PMC3361717. by Chu-Shore C, Kramer MA, Pathmanathan J, Bianchi MT, Westover MB, Wizon L, et al.

See 37 usage examples →

The Cancer Genome Atlas

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Usage examples

Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas by Yang Liu, Nilay S. Sethi, et al.
Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context by Hua-Sheng Chiu, Sonal Somvanshi, et al.
Spatial Organization And Molecular Correlation Of Tumor-Infiltrating Lymphocytes Using Deep Learning On Pathology Images by Joel Saltz, Rajarsi Gupta, et al.
Genomic and Functional Approaches to Understanding Cancer Aneuploidy by Alison M. Taylor, Juliann Shih, et al.
ISB Cancer Genomics Cloud by Institute for Systems Biology

See 29 usage examples →

Foldingathome COVID-19 Datasets

alchemical free energy calculationsbiomolecular modelingcoronavirusCOVID-19foldingathomehealthlife sciencesmolecular dynamicsproteinSARS-CoV-2simulationsstructural biology

Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. Run by the Folding@home Consortium, a worldwide network of research laboratories focusing on a variety of different diseases, Folding@home seeks to address problems in human health on a scale that is infeasible by another other means, sharing the results of these large-scale studies with the research community through peer-reviewed publications and publicly shared datasets. During the COVID-19 epidemic, Folding@h...

Usage examples

Folding@home COVID-19 efforts by Folding@home Consortium
SARS-CoV-2 Simulations Go Exascale to Capture Spike Opening and Reveal Cryptic Pockets Across the Proteome by Maxwell I. Zimmerman, Justin R. Porter, Michael D. Ward, Sukrit Singh, Neha Vithani, Artur Meller, Upasana L. Mallimadugula, Catherine E. Kuhn, Jonathan H. Borowsky, View ORCID ProfileRafal P. Wiewiora, Matthew F. D. Hurley, Aoife M Harbison, Carl A Fogarty, Joseph E. Coffland, Elisa Fadda, Vincent A. Voelz, John D. Chodera, Gregory R. Bowman
SARS-CoV-2 spike RBD with N501Y mutation bound to human ACE2 (953.7 µs) by The Chodera lab at the Memorial Sloan Kettering Cancer Center
SARS-CoV-2 RNA polymerase (nsp12, RdRP) dataset: A 3.4 ms dataset of the SARS-CoV-2 nsp12 protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis
SARS-CoV-2 spike protein dataset: A 1.2 ms dataset of the SARS-CoV-2 spike protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis

See 24 usage examples →

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...

Usage examples

MicroRNA Expression-Based Model Indicates Event-Free Survival in Pediatric Acute Myeloid Leukemia by Lim EL, Trinh DL, Ries RE, et al.
CSF3R mutations have a high degree of overlap with CEBPA mutations in pediatric AM by Maxson JE, Ries RE, Wang YC, et al.
Ancestry and pharmacogenomics of relapse in acute lymphoblastic leukemia by Yang JJ, Cheng C, Devidas M, et al.
TCF21 hypermethylation in genetically quiescent clear cell sarcoma of the kidney by Gooskens SL, Gadd S, Guidry Auvil JM, et al.
Biomarker significance of plasma and tumor miR-21, miR-221, and miR-106a in osteosarcoma by Nakka M, Allen-Rhoades W, Li Y, et al.

See 24 usage examples →

Allen Cell Imaging Collections

biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy

This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science. The types of data included in this bucket are listed below:

Field of view or cropped images of cells
Segmentations of structures in the images (e.g., boundaries of cells, DNA, other intracellular structures, etc.)
Processed versions of the above images and segmentations
Machine learning predictions and labels of the data listed above
Models trained on the previously listed data
Additional supporting non-image data related to the above listed data types (e.g., gene expression data, whole genome sequenc

...

Usage examples

See 20 usage examples →

CELLxGENE Discover Census

Biohubbioinformaticscell biologylife sciencessingle-cell transcriptomicstranscriptomics

CELLxGENE Discover (cellxgene.cziscience.com) is a free-to-use platform for the exploration, analysis, and retrieval of single-cell data. CELLxGENE Discover hosts the largest aggregation of standardized single-cell data from the major human and mouse tissues, with modalities that include gene expression, chromatin accessibility, DNA methylation, and spatial transcriptomics. This year, CELLxGENE Discover has made available all of its human and mouse RNA single-cell data through Census (https://chanzuckerberg.github.io/cellxgene-census/) – a free-to-use service with an API and data that allows f...

Usage examples

See 19 usage examples →

Gabriella Miller Kids First Pediatric Research Program (Kids First)

cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing

The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...

Usage examples

Decreased ACKR3 (CXCR7) function causes oculomotor synkinesis in mice and humans. by Mary C Whitman, Noriko Miyake, et al.
CAVATICA by Seven Bridges Genomics
Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing. by Allison H Seiden, Felix Richter, et al.
Clinically Relevant and Minimally Invasive Tumor Surveillance of Pediatric Diffuse Midline Gliomas Using Patient-Derived Liquid Biopsy by Eshini Panditharatna, Lindsay B Kilburn, et al.
Kids First DRC Portal by Kids First DRC

See 19 usage examples →

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, 4.2, and 4.4

bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing

Overview

This dataset contains alignment files and small variant (includes single nucleotide variants (SNV) and indels), copy number variant (CNV), short tandem repeat (i.e., repeat expansion; STR), structural variant (SV) and other variant call files from the 1000 Genomes Project (1KGP) Phase 3 dataset (3,202 individuals, 602 trios) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, v4.2.7, and v4.4.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more infor

...

Usage examples

PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions by Olson et al (2020)
DRAGEN Support Resources by Illumina Inc.
Illumina Connected Analytics User Guide by Illumina Inc.
DRAGEN Iterative gVCF Genotyper by Illumina Inc.
Data solution empowering population genomics by Illumina Inc. (2021)

See 17 usage examples →

Cell Painting Gallery

bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle

The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained ...

Usage examples

Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling by Wawer MJ, Li K, Gustafsdottir SM, Ljosa V, BodycombeNE, Marton MA, Sokolnicki KL, Bray M-A, Kemp MM, Winchester E, Taylor B, Grant GB, Hon CSK, Duvall JR, Wilson JA, Bittker JA, Dancik V, Narayan R, Subramanian A, Winckler W, Golub TR, Carpenter AE, Shamji AF, Schreiber SL, & Clemons PA
Multiplex Cytological Profiling Assay to Measure Diverse Cellular States by Gustafsdottir SM, Ljosa V, Sokolnicki KL, Wilson JA, Walpita D, Kemp MM, Seiler KP, Carrel HA, Golub TR, Schreiber SL, Clemons PA, Carpenter AE, and Shamji AF
Center for Open Bioimage Analysis (COBA) YouTube Channel - video tutorials of CellProfiler and other softwares by Multiple Authors
Accelerating Drug Discovery with high-throughput Cell Painting on AWS by Chris Kaspar
Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes by Bray M-A, Singh S, Han H, Davis CT, Borgeson B, Hartland C, Kost-Alimova M, Gustafsdottir SM, Gibson CC, & Carpenter AE

See 17 usage examples →

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies. The gnomAD Principal Investigators and team can be found Details →

Usage examples

Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016) by Lek, M., Karczewski, K., Minikel, E. et al.
Transcript expression-aware annotation improves rare variant interpretation. Nature 581, 452–458 (2020) by Cummings, B. B., Karczewski, K. J., Kosmicki, J. A., Seaby, E. G., Watts, N. A., Singer-Berk, M., Mudge, J. M., Karjalainen, J., Kyle Satterstrom, F., O’Donnell-Luria, A., Poterba, T., Seed, C., Solomonson, M., Alföldi, J., The Genome Aggregation Database Production Team, The Genome Aggregation Database Consortium, Daly, M. J., & MacArthur, D. G.
A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020) by Collins, R. L., Brand, H., Karczewski, K. J., Zhao, X., Alföldi, J., Francioli, L. C., Khera, A. V., Lowther, C., Gauthier, L. D., Wang, H., Watts, N. A., Solomonson, M., O’Donnell-Luria, A., Baumann, A., Munshi, R., Walker, M., Whelan, C., Huang, Y., Brookings, T., ... Talkowski, M. E.
Hail by Hail Team
Hail utilities for gnomAD by gnomAD Production Team

See 15 usage examples →

The Singapore Nanopore Expression Data Set

bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics

The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...

Usage examples

Bambu: Transcript discovery and quantification using long read RNA-Seq data by Ying Chen, Andre Sim et al.
A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. by Ying Chen et al.
JAFFAL: Detection of fusion genes from long read RNA-Seq data by Nadia M Davidson et al.
Accessing the SG-NEx dataset on AWS by Ying Chen
Differential RNA modification analysis tutorial with xPore by Yu Song Chuah

See 15 usage examples →

Distributed Archives for Neurophysiology Data Integration (DANDI)

biologycalcium imagingcell imagingelectrophysiologyhdf5life sciencesneuroimagingneurophysiologyneurosciencezarr

DANDI is a public archive of neurophysiology datasets, including raw and processed data, and associated software containers. Datasets are shared according to Creative Commons CC0 or CC-BY licenses. This US BRAIN Initiative supported archive provides a broad range of cellular neurophysiology data including intracellular and extracellular electrophysiology, optophysiology, calcium imaging, fiber photometry, behavioral time-series, and images from immunostaining experiments, from over 20 species.Data is organized using community standards: NWB - Neurodata Without Borders, BIDS - Brain Imaging Data Structure, NGFF - Next Generation File Format for...

Usage examples

Facilitating analysis of open neurophysiology data on the DANDI Archive using large language model tools by Magland JF, Ly R, Rübel O, Dichter B
A comparison of neuroelectrophysiology databases by Subash P, Gray A, Bhattacharyya B, et al.
ITK/VTK Viewer for OME-Zarr by Kitware
NWB Explorer by Open Source Brain
Neurosift: DANDI exploration and NWB visualization in the browser by Magland J, Soules J, Baker C, Dichter B

See 14 usage examples →

Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery

biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience

This data set, made available by Janelia's FlyLight project, consists of fluorescence images of Drosophila melanogaster driver lines, aligned to standard templates, and stored in formats suitable for rapid searching in the cloud. Additional data will be added as it is published.

Usage examples

Tutorial for neuronbridger (R API) by Alexander Bates
Fly Light Split-GAL4 Driver Collection by Rob Svirskas
Color Depth Search Fiji Plugin by Hideo Otsuna
File Operations on AWS S3 by Rob Svirskas
A GAL4-Driver Line Resource for Drosophila Neurobiology by Arnim Jenett, Gerald M Rubin, Teri-TB Ngo, David Shepherd, Christine Murphy, Heather Dionne, Barret D Pfeiffer, Amanda Cavallaro, Donald Hall, Jennifer Jeter, Nirmala Iyer, Dona Fetter, Joanna H Hausenfluck, Hanchuan Peng, Eric T Trautman, Robert R Svirskas, Eugene W Myers, Zbigniew R Iwinski, Yoshinori Aso, Gina M DePasquale, Adrianne Enos, Phuson Hulamm, Shing Chun Benny Lam, Hsing-Hsi Li, Todd R Laverty, Fuhui Long, Lei Qu, Sean D Murphy, Konrad Rokicki, Todd Safford, Kshiti Shaw, Julie H Simpson, Allison Sowell, Susana Tae, Yang Yu, Christopher T Zugates

See 13 usage examples →

RADIANT Public Data

cancergeneticgenomicHomo sapienslife sciencesmedical imagingpediatricradiologytranscriptomicswhole genome sequencing

The Real-time Analysis and Discovery in Integrated And Networked Technologies (RADIANT) initiative seeks to develop an extensible, federated framework for rapid exchange of multimodal clinical and research data on behalf of accelerated discovery and patient impact. Coordination and implementation of initial RADIANT deployments will leverage a network of more than 35 partnered health care systems and participating patient families within the Children’s Brain Tumor Network (CBTN) and the Pediatric Neuro-Oncology Consortium (PNOC). This data set is composed of public multi-modal data provisio...

Usage examples

Multiparametric MRI along with machine learning predicts prognosis and treatment response in pediatric low-grade glioma by Anahita Gathi Kazerooni, Adam Kraya, Komal S Rathi, Meen Chul Kim, et al.
Generation and multi-dimensional profiling of a childhood cancer cell line atlas defines new therapeutic opportunities by Claire Xin Sun, Paul Daniel, Gabrielle Bradshaw et al.
Use of External Control Cohorts in Pediatric Brain Tumor Clinical Trials by Ashley S Margol, Annette M Molinaro, Arzu Onar-Thomas, et al.
A road map for the treatment of pediatric diffuse midline glioma by Carl Koschmann, Wajd N Al-Holou, Marta M Alonso, et al.
CAVATICA by Seven Bridges Genomics

See 13 usage examples →

International Neuroimaging Data-Sharing Initiative (INDI)

Homo sapiensimaginglife sciencesmagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) In addition to the raw data, preprocessed data is also included for some datasets. A complete list of the available datasets can be seen in the documentation lonk provided below.

Usage examples

Configurable Pipeline for the Analysis of Connectomes (C-PAC) by [INDI C-PAC Team](https://fcp-indi.github.io/)
The NKI-Rockland sample: a model for accelerating the pace of discovery science in psychiatry by K.B. Nooner, S.J. Colcombe, ..., M.P. Milham
Assessment of the impact of shared brain imaging data on the scientific literature by M.P. Milham, R.C. Craddock, ..., A. Klein
Downloading FCP-INDI Neuroimaging Data from Amazon S3 by INDI
Making data sharing work: The FCP/INDI experience by M. Mennes, B.B. Biswal, F.X. Castellanos, M.P. Milham

See 11 usage examples →

Open Targets

bioinformaticsbiologydrug discoverygeneticgenomiclife sciencesprotein

The Open Targets Platform is a comprehensive data integration tool that supports systematic identification and prioritisation of potential therapeutic drug targets. By integrating publicly available datasets including data generated by the Open Targets experimental and informatics research programmes, the Platform provides data and services to assist in the task of therapeutic hypothesis building.

Usage examples

See 11 usage examples →

The Cancer Dependency Map (DepMap) Cancer Cell Line Encyclopedia (CCLE) Dataset

bambioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesshort read sequencingtranscriptomicswhole exome sequencingwhole genome sequencing

This dataset consists of whole genome sequencing (WGS), whole exome sequencing (WES), and RNA sequencing files generated from ~1000 cancer cell lines described in Ghandi et al., 2019.

Usage examples

Bridging the gap between cancer cell line models and tumours using gene expression data by Noorbakhsh, Vazquez & McFarland
The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks by Ben Guebila, Wang, Lopes-Ramos et al.
The present and future of the Cancer Dependency Map by Arafeh, Shibue, Dempster et al.
Integrated cross-study datasets of genetic dependencies in cancer by Pacini, Dempster, Boyle et al.
Cancer Cell Line Encyclopedia (CCLE) by Ghandi, Huang, Jané-Valbuena et al.

See 11 usage examples →

Alliance of Genome Resources

bioinformaticsbiologyCaenorhabditis elegansDanio rerioDrosophila melanogasterfastagene expressiongeneticgenomegenomicHomo sapienslife sciencesMus musculusproteinRattus norvegicustranscriptomicsvcf

The Alliance of Genome Resources is a consortium that integrates genomic, genetic, and molecular data from leading model organism databases including Drosophila melanogaster, Caenorhabditis elegans, Danio rerio (zebrafish), Mus musculus (mouse), Rattus norvegicus (rat), Saccharomyces cerevisiae (yeast), Xenopus laevis and Xenopus tropicalis (frogs), and human reference data. The Alliance provides comprehensive datasets including gene annotations, disease associations, expression data (bulk and single-cell RNA-Seq), protein and genetic interactions, orthology relationships, variants and alleles...

Usage examples

See 10 usage examples →

Garvan Institute Long Read Sequencing Benchmark Data

bioinformaticsgenomiclife scienceslong read sequencing

The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Examples include: NA12878 (HG001) and NA24385 (HG002) sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells; and, UHR RNA (direct-RNA) on an ONT PromethION using the latest RNA004 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for ...

Usage examples

Streamlining remote nanopore data access with slow5curl by Wong, B., Ferguson, J.M., Do, J.Y. et al.
Flexible and efficient handling of nanopore sequencing signal data with slow5tools. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.
Fast nanopore sequencing data analysis with SLOW5. by Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al.
Accelerated nanopore basecalling with SLOW5 data format. by Samarakoon, H., Ferguson, J.M., Gamaarachchi H. et al.
Directly processing on an s3fs mount by Hasindu Gamaarachchi

See 10 usage examples →

IBL Neuropixels Brainwide Map on AWS

life sciencesMus musculusneurophysiologyneuroscienceopen source software

Electrophysiological recordings of mouse brain activity acquired during a decision making task.

Usage examples

See 10 usage examples →

NHGRI AnVIL Project

biologygene expressiongenomegenomicHomo sapienslife sciences

The NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL) Project (https://anvilproject.org/) is the National Human Genome Research Institute's cloud-based platform for genomic data sharing and analysis. AnVIL hosts widely used human genome reference datasets generated through NHGRI-funded research. AnVIL on Open Data on AWS provides public access to open-access datasets available through AnVIL. The project is a collaborative effort involving NHGRI, the Broad Institute, Johns Hopkins University, the University of California Santa Cruz, Vanderbilt University Medical Center, Brigh...

Usage examples

The complete sequence and comparative analysis of ape sex chromosomes by Kateryna D. Makova, Brandon D. Pickett, Robert S. Harris, Gabrielle A. Hartley, Monika Cechova, Karol Pal, Sergey Nurk, DongAhn Yoo, Qiuhui Li, Prajna Hebbar, Barbara C. McGrath, Francesca Antonacci, Margaux Aubel, Arjun Biddanda, Matthew Borchers, Erich Bornberg-Bauer, Gerard G. Bouffard, Shelise Y. Brooks, Lucia Carbone, Laura Carrel, Andrew Carroll, Pi-Chuan Chang, Chen-Shan Chin, Daniel E. Cook, Sarah J. C. Craig, Luciana de Gennaro, Mark Diekhans, Amalia Dutra, Gage H. Garcia, Patrick G. S. Grady, Richard E. Green, Diana Haddad, Pille Hallast, William T. Harvey, Glenn Hickey, David A. Hillis, Savannah J. Hoyt, Hyeonsoo Jeong, Kaivan Kamali, Sergei L. Kosakovsky Pond, Troy M. LaPolice, Charles Lee, Alexandra P. Lewis, Yong-Hwee E. Loh, Patrick Masterson, Kelly M. McGarvey, Rajiv C. McCoy, Paul Medvedev, Karen H. Miga, Katherine M. Munson, Evgenia Pak, Benedict Paten, Brendan J. Pinto, Tamara Potapova, Arang Rhie, Joana L. Rocha, Fedor Ryabov, Oliver A. Ryder, Samuel Sacco, Kishwar Shafin, Valery A. Shepelev, Viviane Slon, Steven J. Solar, Jessica M. Storer, Peter H. Sudmant, Sweetalana, Alex Sweeten, Michael G. Tassia, Françoise Thibaud-Nissen, Mario Ventura, Melissa A. Wilson, Alice C. Young, Huiqing Zeng, Xinru Zhang, Zachary A. Szpiech, Christian D. Huber, Jennifer L. Gerton, Soojin V. Yi, Michael C. Schatz, Ivan A. Alexandrov, Sergey Koren, Rachel J. O’Neill, Evan E. Eichler, Adam M. Phillippy
The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update by The Galaxy Community
The complete sequence of a human Y chromosome by Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J. Hoyt, Dylan J. Taylor, Nicolas Altemose, Paul W. Hook, Sergey Koren, Mikko Rautiainen, Ivan A. Alexandrov, Jamie Allen, Mobin Asri, Andrey V. Bzikadze, Nae-Chyun Chen, Chen-Shan Chin, Mark Diekhans, Paul Flicek, Giulio Formenti, Arkarachai Fungtammasan, Carlos Garcia Giron, Erik Garrison, Ariel Gershman, Jennifer L. Gerton, Patrick G. S. Grady, Andrea Guarracino, Leanne Haggerty, Reza Halabian, Nancy F. Hansen, Robert Harris, Gabrielle A. Hartley, William T. Harvey, Marina Haukness, Jakob Heinz, Thibaut Hourlier, Robert M. Hubley, Sarah E. Hunt, Stephen Hwang, Miten Jain, Rupesh K. Kesharwani, Alexandra P. Lewis, Heng Li, Glennis A. Logsdon, Julian K. Lucas, Wojciech Makalowski, Christopher Markovic, Fergal J. Martin, Ann M. Mc Cartney, Rajiv C. McCoy, Jennifer McDaniel, Brandy M. McNulty, Paul Medvedev, Alla Mikheenko, Katherine M. Munson, Terence D. Murphy, Hugh E. Olsen, Nathan D. Olson, Luis F. Paulin, David Porubsky, Tamara Potapova, Fedor Ryabov, Steven L. Salzberg, Michael E. G. Sauria, Fritz J. Sedlazeck, Kishwar Shafin, Valery A. Shepelev, Alaina Shumate, Jessica M. Storer, Likhitha Surapaneni, Angela M. Taravella Oill, Françoise Thibaud-Nissen, Winston Timp, Marta Tomaszkiewicz, Mitchell R. Vollger, Brian P. Walenz, Allison C. Watwood, Matthias H. Weissensteiner, Aaron M. Wenger, Melissa A. Wilson, Samantha Zarate, Yiming Zhu, Justin M. Zook, Evan E. Eichler, Rachel J. O’Neill, Michael C. Schatz, Karen H. Miga, Kateryna D. Makova, Adam M. Phillippy
Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References by Dylan J. Taylor, Jordan M. Eizenga, Qiuhui Li, Arun Das, Katharine M. Jenike, Eimear E. Kenny, Karen H. Miga, Jean Monlong, Rajiv C. McCoy, Benedict Paten, and Michael C. Schatz
A complete reference genome improves analysis of human genetic variation by Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, Justin Wagner, Jennifer McDaniel, Nathan D. Olson, Michael E. G. Sauria, Mitchell R. Vollger, Arang Rhie, Melissa Meredith, Skylar Martin, Joyce Lee, Sergey Koren, Jeffrey A. Rosenfeld, Benedict Paten, Ryan Layer, Chen-Shan Chin, Fritz J. Sedlazeck, Nancy F. Hansen, Danny E. Miller, Adam M. Phillippy, Karen H. Miga, Rajiv C. McCoy, Megan Y. Dennis, Justin M. Zook, Michael C. Schatz

See 13 usage examples →

Open NeuroData

array tomographybiologyelectron microscopyimage processinglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets (as Neuroglancer Precomputed Volumes) across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include segmentations and meshes.

Usage examples

CloudVolume by William Silversmith
Visualization using Neuroglancer by Benjamin Falk
The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience by R. Burns, W. G. Roncal, D. Kleissas, K. Lillaney, P. Manavalan, E. Perlman, D. R. Berger, D. D. Bock, K. Chung, L. Grosenick, N. Kasthuri, N. C. Weiler, K. Deisseroth, M. Kazhdan, J. Lichtman, R. C. Reid, S. J. Smith, A. S. Szalay, J. T. Vogelstein, and R. J. Vogelstein.
Igneous by William Silversmith
From cosmos to connectomes: The evolution of data-intensive science by R. Burns, J. T. Vogelstein, and A. S. Szalay

See 9 usage examples →

PubSeq - Public Sequence Resource

bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL

COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.

Usage examples

See 9 usage examples →

Steinegger Lab Datasets

bioinformaticslife sciencesmetagenomicsopen source softwareproteinprotein folding

The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond ...

Usage examples

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets by Steinegger M and Söding J
Foldseek User Guide by Mirdita M and Steinegger M
Fast and accurate protein structure search with Foldseek by van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al.
ColabFold User Guide by Mirdita M and Ovchinnikov S
ColabFold Google Colab Notebook by Ovchinnikov S, Mirdita M and Steinegger M

See 9 usage examples →

CHAMMI-75

biologycell imagingfluorescence imaginghigh-throughput imagingimaginglife sciencesmachine learningmicroscopy

Quantifying cell morphology using images and machine learning models has proven to be a powerful tool to study the response of cells to treatments. However, the models used to quantify cellular morphology are typically trained with a single microscopy imaging type and under controlled experimental conditions. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. We have created CHAMMI-75, a large-scale dat...

Usage examples

CHAMMI Benchmarking Source Code by Chau Pham
MorphEm Model (Trained on CHAMMI-75) by Vidit Agrawal, John Peters, Juan Caicedo
CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images by Vidit Agrawal, John Peters, Tyler N. Thompson, Mohammad Vali Sanian, Chau Pham, Nikita Moshkov, Arshad Kazi, Aditya Pillai, Jack Freeman, Byunguk Kang, Samouil L. Farhi, Ernest Fraenkel, Ron Stewart, Lassi Paavolainen, Bryan A. Plummer, Juan C. Caicedo
CHAMMI-75 Website by Vidit Agrawal, Juan Caicedo
Get To Know A Dataset: CHAMMI-75 by Vidit Agrawal, Juan Caicedo

See 8 usage examples →

Cancer Cell Line Encyclopedia (CCLE)

cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.

Usage examples

ISB CGC BigQuery tables by Institute for Systems Biology
Broad Institute Cancer Cell Line Encyclopedia by The Broad Institute of MIT & Harvard
The landscape of cancer cell line metabolism by Li, H. et al.
Next-generation characterization of the Cancer Cell Line Encyclopedia by Ghandi, M., Huang F. et al.
Genomic Data Commons by National Cancer Institute

See 8 usage examples →

Logan Unitigs and Contigs of the Sequence Read Archive (SRA) on AWS

fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...

Usage examples

See 8 usage examples →

NIH Roadmap Epigenomics

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The project has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility, DNA methylation and mRNA expression across 100s of human cell types and tissues. To see what data is available, please check the directory listing: https://roadmapepigenomics.s3.us-west-2.amazonaws.com/index.html.

Usage examples

Visualize Roadmp data with WashU Epigenome Browser by WashU Epigenome Browser
Human body epigenome maps reveal noncanonical DNA methylation variation by Matthew D. Schultz, Yupeng He, John W. Whitaker, Manoj Hariharan, Eran A. Mukamel, Danny Leung, Nisha Rajagopal, Joseph R. Nery, Mark A. Urich, Huaming Chen, Shin Lin, Yiing Lin, Inkyung Jung, Anthony D. Schmitt, Siddarth Selvaraj, Bing Ren, Terrence J. Sejnowski, Wei Wang & Joseph R. Ecker
Navigation of Roadmap data using Roadmap web portal by Anshul Kundaje Lab
Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. by Elizabeta Gjoneska, Andreas R. Pfenning, Hansruedi Mathys, Gerald Quon, Anshul Kundaje, Li-Huei Tsai & Manolis Kellis.
Integrative analysis of 111 reference human epigenomes by Roadmap Epigenomics Consortium, Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour etc.al, Ting Wang, Manolis Kellis

See 8 usage examples →

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Usage examples

Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S.
The NIEHS TaRGET II Consortium and environmental epigenomics by Wang, T., Pehrsson, E., Purushotham, D. et al.
The role of environmental exposures and the epigenome in health and disease. by Perera BPU, Faulk C, Svoboda LK, Goodrich JM, Dolinoy DC.
Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al.
Comparison of differential accessibility analysis strategies for ATAC-seq data by Gontarz P, Fu S, Xing X, Liu S, Miao B et.al.

See 8 usage examples →

BossDB Open Neuroimagery Datasets

calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography

This data ecosystem, Brain Observatory Storage Service & Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.

Usage examples

CloudVolume by Seung Lab
Get To Know A Dataset - BossDB by BossDB Team
intern: Integrated Toolkit for Extensible and Reproducible Neuroscience by Jordan K Matelsky, Luis Rodriguez, Daniel Xenes, Timothy Gion, Robert Hider Jr., Brock Wester, William Gray-Roncal
A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns
Data access and download by Jordan Matelsky

See 7 usage examples →

CIViC (Clinical Interpretation of Variants in Cancer)

cancergeneticgenomiclife sciencesvcf

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...

Usage examples

See 7 usage examples →

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.

Usage examples

Proteomic analysis of colon and rectal carcinoma using standard and customized databases by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC
Cancer Genomics Cloud by Seven Bridges
CPTAC Data Portal by National Cancer Institute
Proteomic Data Commons by National Cancer Institute
Genomic Data Commons by National Cancer Institute

See 7 usage examples →

IBL Neuropixels Reproducible Ephys Data on AWS

life sciencesMus musculusneurophysiologyneuroscienceopen source software

Electrophysiological recordings acquired using Neuropixels probes in different mice and labs, targeting the same brain locations (including posterior parietal cortex, hippocampus, and thalamus).

Usage examples

See 7 usage examples →

ICGC on AWS

bamcancergeneticgenomiclife sciencesvcf

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Usage examples

See 7 usage examples →

SnpEff & SnpSift Genomic Variant Annotation Databases

bioinformaticscancergeneticgenomegenomiclife sciencesproteinstructural variationtranscriptomicsvariant annotationvcfwhole exome sequencingwhole genome sequencing

SnpEff is a variant annotation and effect prediction tool that annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes). It supports over 38,000 genomes and provides comprehensive genomic databases for variant annotation. The databases include reference genomes, gene annotations, protein sequences, and regulatory elements from trusted sources like ENSEMBL, RefSeq, and UCSC. SnpSift complements SnpEff by providing tools to annotate genomic variants using databases, filter large genomic datasets, and manipulate annotated variants. Together, these ...

Usage examples

See 7 usage examples →

Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Genomic Data Commons by National Cancer Institute
Cancer Genomics Cloud by Seven Bridges
Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma by Clark DJ, Dhanasekaran SM, Petralia F, Pan J, Song X, Hu Y, da Veiga Leprevost F, Reva B, Lih TM, Chang HY, Ma W, Huang C, Ricketts CJ, Chen L1, Krek A, Li Y, Rykunov D, Li QK, Chen LS, Ozbek U, Vasaikar S, Wu Y, Yoo S, Chowdhury S, Wyczalkowski MA, Ji J, Schnaubelt M, Kong A, Sethuraman S, Avtonomov DM, Ao M, Colaprico A, Cao S, Cho KC, Kalayci S, Ma S, Liu W, Ruggles K, Calinawan A, Gümüş ZH, Geizler D, Kawaler E, Teo GC, Wen B, Zhang Y, Keegan S, Li K, Chen F, Edwards N, Pierorazio PM, Chen XS, Pavlovich CP, Hakimi AA, Brominski G, Hsieh JJ, Antczak A, Omelchenko T, Lubinski J, Wiznerowicz M, Linehan WM, Kinsinger CR, Thiagarajan M, Boja ES, Mesri M, Hiltke T, Robles AI, Rodriguez H, Qian J, Fenyö D, Zhang B, Ding L, Schadt E, Chinnaiyan AM, Zhang Z, Omenn GS, Cieslik M, Chan DW, Nesvizhskii AI, Wang P, Zhang H; Clinical Proteomic Tumor Analysis Consortium
Proteomic Data Commons by National Cancer Institute
CPTAC Data Portal by National Cancer Institute

See 6 usage examples →

CryoET Data Portal

Biohubcell biologycryo electron tomographyelectron tomographylife sciencesmachine learningsegmentationstructural biology

Cryo-electron tomography (cryoET) is a powerful technique for visualizing 3D structures of cellular macromolecules at near atomic resolution in their native environment. Observing the inner workings of cells in context enables better understanding about the function of healthy cells and the changes associated with disease. However, the analysis of cryoET data remains a significant bottleneck, particularly the annotation of macromolecules within a set of tomograms, which often requires a laborious and time-consuming process of manual labelling that can take months to complete. Given the current...

Usage examples

See 6 usage examples →

ESM Atlas — Protein Features and Structures

Biohubbioinformaticslife sciencesmachine learningmetagenomicsproteinstructural biology

The ESM Atlas is a large-scale public dataset of computational outputs generated by ESMC and ESMFold2, derived from a deduplicated set of over 6.8 billion publicly available protein sequences spanning all domains of life — including viral proteins and previously unannotated sequences representing metagenomic dark matter sampled from a wide range of biomes. The dataset includes two primary components. A sparse autoencoder (SAE) features for ~6.8 billion proteins, capturing interpretable biological representations from the ESMC 6B model, and predicted three-dimensional protein structures for ~1....

Usage examples

See 6 usage examples →

IBL Behavioral Data on AWS

life sciencesMus musculusneurophysiologyneuroscienceopen source software

Behavioral data of mice performing a decision-making task, associated with 2020 publication of the IBL.

Usage examples

See 6 usage examples →

NYU Langone & FAIR FastMRI Dataset

biologyhealthimage processingimaginglife sciencesmagnetic resonance imagingneurobiologyneuroimaging

This dataset contains deidentified raw k-space data and DICOM image files of over 1,500 knees and 6,970 brains.

Usage examples

See 6 usage examples →

Open Bioinformatics Reference Data for Galaxy

bioinformaticsbiologygeneticgenomiclife sciencesreference index

This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...

Usage examples

Accessible, curated metagenomic data through ExperimentHub by Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron
Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages by Tiago C. Silva, Antonio Colaprico, Catharina Olsen, Fulvio D'Angelo, Gianluca Bontempi, Michele Ceccarelli, Houtan Noushmehr
Bioconductor by Bioconductor Project

See 6 usage examples →

Serratus: Ultra-deep Search for Novel Viruses - Versioned Data Release

bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus

Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.

Usage examples

Petabase-scale sequence alignment catalyses viral discovery by Edgar R., Taylor J., Lin V., et al (2021)
Serratus Explorer by Serratus Team
Tantalus: An R Package for exploration of Serratus data by Serratus Team
Diversification of mammalian deltaviruses by host shifting by Bergner L.M., Orton R.J., et al (2021)
Ribovirus classification by a polymerase barcode sequence by Babaian A., and Edgar R. (2021)

See 6 usage examples →

3000 Rice Genomes Project

agriculturefood securitygeneticgenomiclife sciences

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.

Usage examples

Identification and Allele Combination Analysis of Rice Grain Shape-Related Genes by Genome-Wide Association Study by Meng B et al (2022)
RiceGalaxy by International Rice Research Institute
Rice Galaxy: an open resource for plant science by Juanillas V et al (2019)
Structural variants in 3000 rice genomes by Fuentes RR et al (2019)
Tracking the origin of two genetic components associated with transposable element bursts in domesticated rice by Chen J et al (2019)

See 5 usage examples →

Automated Segmentation of Intracellular Substructures in Electron Microscopy (ASEM) on AWS

biologycell biologycomputer visionelectron microscopyimaginglife sciencesmicroscopysegmentation

The Automated Segmentation of intracellular substructures in Electron Microscopy (ASEM) project provides deep learning models trained to segment structures in 3D images of cells acquired by Focused Ion Beam Scanning Electron Microscopy (FIB-SEM). Each model is trained to detect a single type of structure (mitochondria, endoplasmic reticulum, golgi apparatus, nuclear pores, clathrin-coated pits) in cells prepared via chemically-fixation (CF) or high-pressure freezing and freeze substitution (HPFS). You can use our open source pipeline to load a model and predict a class of sub-cellular structur...

Usage examples

TK Lab Data Explorer by Patrick Stock
Deep neural network automated segmentation of cellular structures in volume electron microscopy by Benjamin Gallusser, Giorgio Maltese, Giuseppe Di Caprio, Tegy John Vadakkan, Anwesha Sanyal, Elliott Somerville, Mihir Sahasrabudhe, Justin O’Connor, Martin Weigert, Tom Kirchhausen
ASEM Colab Notebook (Interactive Demo) by Patrick Stock
How to use models by kirchhausenlab
Data layout and how to view by kirchhausenlab

See 5 usage examples →

CAncer MEtastases in LYmph nOdes challeNge (CAMELYON) Dataset

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the all data for the CAncer MEtastases in LYmph nOdes challeNge or CAMELYON. CAMELYON was the first challenge using whole-slide images in computational pathology and aimed to help pathologists identify breast cancer metastases in sentinel lymph nodes. Lymph node metastases are extremely important to find, as they indicate that the cancer is no longer localized and systemic treatment might be warranted. Searching for these metastases in H&E-stained tissue is difficult and time-consuming and AI algorithms can play a role in helping make this faster and more accura...

Usage examples

See 5 usage examples →

Caenorabditis Diversity Natural Resource

bambioinformaticsbiologyCaenorhabditis elegansfastqgatk-svgenetic mapsgenomegenome wide association studygenomiclife sciencesshort read sequencingvariant annotationvcf

The Caenorhabditis Natural Diversity Resource (CaeNDR) is a data repository and analysis hub of wild strains of selfing Caenhorabditis species C. elegans, C. briggsae, and C. tropicalis from around the world to facilitate discovery of genetic variation across all three species through genome-wide association mappings to correlate genotype with phenotype and identify genetic variation underlying quantitative traits.

Usage examples

Data Releases - C. tropicalis by Erik Andersen
FAQ - AWS API by Erik Andersen
Data Releases - C. briggsae by Erik Andersen
CaeNDR, the Ceanorhabditis Natural Diversity Resource by Crombie TA, McKeown R, Moya ND, Evans KS, Widmayer SJ, LaGrassa V, et al.
Data Releases - C. elegans by Erik Andersen

See 5 usage examples →

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...

Usage examples

"Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An Interim Analysis of the Mmrf Commpass Study" by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, Kristi Stephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M. Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, Winnie Liang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Sagar Lonial, MD
"Interim Analysis of the Mmrf Commpass Trial: Identification of Novel Rearrangements Potentially Associated with Disease Initiation and Progression" by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, Jessica Aldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer, Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD
"Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysis of the Mmrf CoMMpass Study" by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, Jessica Aldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma C Gutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, Seungchan Kim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Daniel Auclair, PhD, Sagar Lonial, MD FACP
Genomic Data Commons by National Cancer Institute
"Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple Myeloma Relating Clinical Outcomes To Genomic and Immunophenotypic Profiles" by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, Harrison B, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRF CoMMpass Network, Capone W, Carpten J, Lonial S.

See 5 usage examples →

EEGDash on AWS

deep learninglife sciencesmachine learningneuroimagingneuroscience

The EEG-DaSh (EEG Data Sharing) data archive is a large-scale data-sharing resource for magnetoencephalography and electroencephalography (MEEG) data hosted at the Swartz Center for Computational Neuroscience (SCCN), UC San Diego. It provides curated, BIDS-formatted datasets for neuroscience research, machine learning, and deep learning applications. The archive spans three S3 buckets: (1) the EEGDash bucket for data served through the EEGDash platform, (2) the NEMAR archive containing datasets contributed through the NEMAR (Neuroelectromagnetic Data Archive and Tools Resource) platform, which...

Usage examples

NEMAR: an open access data, tools, and compute resource operating on neuroelectromagnetic data by A. Delorme, D. Truong, C. Youn, T. Mullen, A. Smetanin, S. Bhatt, S. Makeig
eegdash on pypi.python.org - Python module to query and download EEGDash data from Amazon S3 by Young Truong
Deep Learning on EEGDash Data example by SCCN
NEMAR - Neuroelectromagnetic Data Archive and Tools Resource by SCCN
EEGLAB - An open source environment for electrophysiological signal processing by SCCN

See 5 usage examples →

IBL Neuropixels Brainwide Map on AWS

autism spectrum disorderlife sciencesMus musculusneurophysiologyneuroscienceopen source software

Electrophysiological recordings of mouse brain activity acquired during a decision making task in multiple autism mice models.

Usage examples

See 5 usage examples →

MONKEY

cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging

This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...

Usage examples

See 5 usage examples →

Meta-Organized Stimuli And fMRI Imaging data for Computational modeling (MOSAIC)

brain imagesbrain modelshdf5life sciencesmachine learningneuroimagingneuroscience

This extensible dataset, MOSAIC, aggregates individual functional magnetic resonance imaging (fMRI) datasets by leveraging a shared preprocessing pipeline and stimulus curation procedure. This dataset aggregation procedure achieves the scale necessary for neural network training and the diversity needed for generalizable results.

Usage examples

MOSAIC Python package (mosaic-dataset) by Mayukh Deb
Load HDF5 file (Jupyter notebook) by Benjamin Lahner
Download MOSAIC data, visualize fMRI responses, load and run brain-optimized models (Jupyter notebook) by Mayukh Deb
Run a synthetic localizer experiment using MOSAIC's brain-optimized models (Jupyter notebook) by Benjamin Lahner
Preprocess fMRI datasets with MOSAIC shared pipeline by Benjamin Lahner

See 5 usage examples →

NIH NCBI Sequence Read Archive (SRA) on AWS

bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...

Usage examples

See 5 usage examples →

OME-Zarr Open SciVis Datasets

biologycomputed tomographyimage processingimaginglife sciencesmagnetic resonance imagingneuroimagingneurosciencevolumetric imagingzarr

This project provides the Open SciVis Datasets in a chunked, highly-compressed, multi-scale format, encodes metadata in JSON according to the OME-Zarr specification, and hosts the datasets on AWS S3 through the AWS Open Data Program, aiming to serve as a web-based resource for the scientific visualization community to enhance reproducibility and facilitate testing and development of OME-Zarr tools.

Usage examples

OME-Zarr: a cloud-optimized bioimaging file format with international community support by Josh Moore, Daniela Basurto-Lozada, Sébastien Besson, John Bogovic, Jordão Bragantini, Eva M. Brown, Jean-Marie Burel, Xavier Casas Moreno, Gustavo de Medeiros, Erin E. Diel, David Gault, Satrajit S. Ghosh, Ilan Gold, Yaroslav O. Halchenko, Matthew Hartley, Dave Horsfall, Mark S. Keller, Mark Kittisopikul, Gabor Kovacs, Aybüke Küpcü Yoldaş, Koji Kyoda, Albane le Tournoulx de la Villegeorges, Tong Li, Prisca Liberali, Dominik Lindner, Melissa Linkert, Joel Lüthi, Jeremy Maitin-Shepard, Trevor Manz, Luca Marconato, Matthew McCormick, Merlin Lange, Khaled Mohamed, William Moore, Nils Norlin, Wei Ouyang, Bugra Özdemir, Giovanni Palla, Constantin Pape, Lucas Pelkmans, Tobias Pietzsch, Stephan Preibisch, Martin Prete, Norman Rzepka, Sameeul Samee, Nicholas Schaub, Hythem Sidky, Ahmet Can Solak, David R. Stirling, Jonathan Striebel, Christian Tischer, Daniel Toloudis, Isaac Virshup, Petr Walczysko, Alan M. Watson, Erin Weisbart, Frances Wong, Kevin A. Yamauchi, Omer Bayraktar, Beth A. Cimini, Nils Gehlenborg, Muzlifah Haniffa, Nathan Hotaling, Shuichi Onami, Loic A. Royer, Stephan Saalfeld, Oliver Stegle, Fabian J. Theis & Jason R. Swedlow
Open SciVis Datasets by Pavol Klacansky
A list of tools and libraries with OME-Zarr support by NGFF community
OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies by Josh Moore, Chris Allan, Sébastien Besson, Jean-Marie Burel, Erin Diel, David Gault, Kevin Kozlowski, Dominik Lindner, Melissa Linkert, Trevor Manz, Will Moore, Constantin Pape, Christian Tischer & Jason R. Swedlow
Read and Visualize in Python by Matt McCormick

See 5 usage examples →

Protein Data Bank 3D Structural Biology Data

amino acidarchivesbioinformaticsbiomolecular modelingcell biologychemical biologyCOVID-19electron microscopyelectron tomographyenzymelife sciencesmoleculenuclear magnetic resonancepharmaceuticalproteinprotein templateSARS-CoV-2structural biologyx-ray crystallography

The "Protein Data Bank (PDB) archive" was established in 1971 as the first open-access digital data archive in biology. It is a collection of three-dimensional (3D) atomic-level structures of biological macromolecules (i.e., proteins, DNA, and RNA) and their complexes with one another and various small-molecule ligands (e.g., US FDA approved drugs, enzyme co-factors). For each PDB entry (unique identifier: 1abc or PDB_0000001abc) multiple data files contain information about the 3D atomic coordinates, sequences of biological macromolecules, information about any small molecules/ligan...

Usage examples

Announcing the worldwide Protein Data Bank by Berman, H., Henrick, K. & Nakamura, H.
PDB 101 by RCSB PDB
Get to Know a Dataset: Protein Data Bank 3D Structural Biology Data by RCSB PDB
Protein Data Bank: the single global archive for 3D macromolecular structure data by wwPDB consortium
File Download Services by RCSB PDB

See 5 usage examples →

SPARC: Datasets bridging the body and the brain

bioinformaticselectrophysiologylife sciencesmicroscopyneurophysiologyneuroscience

The SPARC Datasets comprise a collection of scientific data that is focused on bridging the body and the brain. The datasets focus on neural connectivity, organ innervation and detailed anatomical mapping of the peripheral nervous system. SPARC datasets distinguish themselves from other data resources through its multi-modal approach to scientific data and integrates molecular, imaging, timeseries and other datatypes associated with the interaction between the peripheral nervous system and organs. SPARC data provides a unique integrated effort to develop next generation mapping of anatomical ...

Usage examples

OSPARC by Esra Neufeld
The SPARC DRC: Building a Resource for the Autonomic Nervous System Community by Osanlouy M, Bandrowski A, de Bono B, Brooks D, Cassara A, Christie R, Ebrahimi N, Gillespie T, Grethe J, Guercio L, Heal M, Lin M, Kuster N, Martone M, Neufeld E, Nickerson D, Soltani E, Tappan S, Wagenaar J, Zhuang K, Hunter P
Downloading large scale SPARC datasets by The SPARC Data and Resource Center
Download public data, scaffolds and run computations by The SPARC Data and Resource Center
The Pennsieve Data Management Platform by Joost Wagenaar

See 8 usage examples →

The Human Connectome Project

biologyimaginglife sciencesneurobiologyneuroimagingneuroscience

The Human Connectome Project (HCP Young Adult, HCP-YA) is mapping the healthy human connectome by collecting and freely distributing neuroimaging and behavioral data on 1,200 normal young adults, aged 22-35.

Usage examples

The Human Connectome Project: A retrospective by Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JL, Burgess GC, Curtiss SW, et al.
The minimal preprocessing pipelines for the Human Connectome Project by Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, Xu J, Jbabdi S, et al.
The WU-Minn Human Connectome Project: an overview. by Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil, K, and the WU-Minn HCP Consortium.
The Human Connectome Workbench by The Human Connectome Project
Exploring the Human Connectom by The Human Connectome Project

See 5 usage examples →

APEX-CONNECTS

analysis ready databrain imagesbrain modelsimaginginfrastructurejsonlife sciencesmachine learningmetadatamicroscopyneuroimagingneuroscienceniftizarr

The BRAIN Initiative Connectivity Across Scales (CONNECTS) program is working to create detailed maps of brain wiring across different species and scales, using advanced imaging technologies. APEX supports this effort by serving as a central hub that brings together and coordinates data and tools from research focused on brain connectivity in humans and animals. Together, these efforts aim to improve our understanding of how the brain is structured and functions.

Usage examples

See 4 usage examples →

BUSCO Datasets

assemblybacteriabioinformaticsgenomiclife sciencesmetagenomicsopen source softwareproteinvirus

Lineage datasets for use with BUSCO software package. Each dataset contains HMM profiles for clade specific, universal, single-copy marker genes. Datasets are available across archaea, bacteria, eukaryota and virus domains. The repository also includes necessary data files for phylogenetic placement of an input assembly.

Usage examples

OrthoDB and BUSCO update - annotation of orthologs with wider sampling of genomes. by Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva
BUSCO - assessing genomic data quality and beyond. by Mosè Manni, Matthew R. Berkeley, Mathieu Seppey, Evgeny M. Zdobnov
BUSCO Update - Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. by Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, Evgeny M Zdobnov
BUSCO - from QC to gene prediction and phylogenomics by Matthew Berkeley

See 4 usage examples →

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indexSTRIDEStranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Usage examples

BLAST+ Docker by NCBI BLAST
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang

See 4 usage examples →

Encyclopedia of DNA Elements (ENCODE)

bioinformaticsbiologygeneticgenomiclife sciences

The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a ...

Usage examples

See 4 usage examples →

Epigenomes of the Human Pangenome Reference Consortium (HPRC) Release 2

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The Human Pangenome Reference Consortium (HPRC) Release 2 represents a landmark achievement in genomics, providing high-quality phased genome assemblies from over 200 individuals with comprehensive functional genomics data. The HPRC Epigenome Browser provides researchers a way to explore all epigenomics data generated by release 2. The HPRC Epigenome Browser (HPRCEB) is a modern, interactive web portal that democratizes access to HPRC Release 2 epigenomics data through an intuitive interface supporting genome selection, data visualization, and bulk download capabilities. The portal integrates ...

Usage examples

WashU Epigenome Browser update 2025 by Chanrung Seng, Shane Liu, Wenjin Zhang, Xiaoyu Zhuo, Daofeng Li, Ting Wang
"Modbed track: Visualization of modified bases in single-molecule sequencing" by Daofeng Li, Xiaoyu Zhuo, Jessica K. Harrison, Shane Liu, Ting Wang
A draft human pangenome reference by Liao, WW., Asri, M., Ebler, J. et al.
"Get To Know A Dataset: HPRC Epigenome" by HPRC Epigenome Browser

See 4 usage examples →

Epilepsy.Science

bioinformaticselectrophysiologylife sciencesmedicineneuroscience

Epilepsy.Science comprise a set of datasets focused on Epilepsy Research that span both Clinical Data and Pre-clinical data. Datasets are contributed by the Epilepsy Research community and published using a standardized structure and metadata. Clinical datasets include de-identified subject information, EEG, and clinical imaging.

Usage examples

The Epilepsy.Science Portal by Joost Wagenaar, Brandon Westover, Kathryn Davis, Nishant Sinha, Brian Litt
Submitting a dataset proposal by Pennsieve
The Pennsieve Data Management Platform by Joost Wagenaar
Pennsieve Open Repositories by Pennsieve

See 4 usage examples →

Genome in a Bottle on AWS

geneticgenomiclife sciencesreference indexvcf

Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.

Usage examples

GA4GH Benchmarking Tools by GA4GH Benchmarking Team
The Genome in a Bottle Github Project by Genome In A Bottle Consortium
Extensive sequencing of seven human genomes to characterize benchmark reference materials by Zook J et al (2016)
High-coverage, long-read sequencing of Han Chinese trio reference samples by Wang Y et al (2019)

See 4 usage examples →

International Skin Imaging Collaboration (ISIC) Archive

biologycancerclassificationcomputational pathologydicomgrand-challenge.orghealthHomo sapiensimaginglife sciencesmachine learningmedical image computingmedical imagingmedicinemicroscopysegmentation

A public-access archive of skin lesion images, supporting teaching, research, and the development and evaluation of diagnostic algorithms.

Usage examples

isic-cli - The official command line tool for interacting with the ISIC Archive by International Skin Imaging Collaboration (ISIC)
Human surface anatomy terminology for dermatology: a Delphi consensus from the International Skin Imaging Collaboration by Navarrete-Dechent C, Liopyris K, Molenda M, Braun R, Curiel-Lewandrowski C, Dusza S, et al
A patient-centric dataset of images and metadata for identifying melanomas using clinical context by Rotemberg V, Kurtansky N, Betz-Stablein B, Caffery L, Chousakos E, Codella N, et al
The SLICE-3D dataset: 400,000 skin lesion image crops extracted from 3D TBP for skin cancer detection by Kurtansky N, D'Alessandro B, Gillis M, Betz-Stablein B, Cerminara S, Garcia R, et al
International Skin Imaging Collaboration - Designated Diagnoses (ISIC-DX): Consensus terminology for lesion diagnostic labeling by Scope A, Liopyris K, Weber J, Barnhill R, Braun R, Curiel-Lewandrowski C, et al

See 7 usage examples →

Molecular Profiling to Predict Response to Treatment (phs001965)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...

Usage examples

Genetic changes associated with relapse in favorable histology Wilms tumor: A Children's Oncology Group AREN03B2 study by Samantha Gadd, Vicki Huff, et al.
Genomic Data Commons by National Cancer Institute
Finding the way to Wilms tumor by comparing the primary and relapse tumor samples by Filippo Spreafico, Sara Ciceri, et al.
Childhood Cancer Data Initiative Data Catalog by National Cancer Institute

See 4 usage examples →

Mouse Brain Anatomy: MouseLight Imagery

biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience

This data set, made available by Janelia's MouseLight project, consists of images and neuron annotations of the Mus musculus brain, stored in formats suitable for viewing and annotation using the HortaCloud cloud-based annotation system.

Usage examples

MouseLight NeuronBrowser by Tiago A. Ferreira, Jayaram Chandrashekar
HortaCloud by David Schauder, Donald J. Olbris, Jody Clements, Cristian Goina, Robert R. Svirskas, Konrad Rokicki
MouseLight Project Website by Tiago A. Ferreira, Jayaram Chandrashekar
Reconstruction of 1,000 Projection Neurons Reveals New Cell Types and Organization of Long-Range Connectivity in the Mouse Brain by Johan Winnubst, Erhan Bas, Tiago A. Ferreira, Zhuhao Wu, Michael N. Economo, Patrick Edson, Ben J. Arthur, Christopher Bruns, Konrad Rokicki, David Schauder, Donald J. Olbris, Sean D. Murphy, David G. Ackerman, Cameron Arshadi, Perry Baldwin, Regina Blake, Ahmad Elsayed, Mashtura Hasan, Daniel Ramirez, Bruno Dos Santos, Monet Weldon, Amina Zafar, Joshua T. Dudman, Charles R. Gerfen, Adam W. Hantman, Wyatt Korff, Scott M. Sternson, Nelson Spruston, Karel Svoboda, Jayaram Chandrashekar

See 4 usage examples →

OpenCell on AWS

Biohubbiologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy

The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library.These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.

Usage examples

See 4 usage examples →

OpenFold3 Training Data

life sciencesmsaopen source softwareopenfoldproteinprotein foldingprotein template

This dataset contains MSAs and predicted structures used to train OpenFold3 preview, an open-source, all-atom ligand, RNA and protein structure prediction software. This includes -

PDB - 245k structures and alignments from the RCSB Protein Data Bank - https://www.rcsb.org/
Long monomer distillation set - ~13 million long (sequence length >= 200 amino acids) monomers from the MGNIFY database - https://www.ebi.ac.uk/metagenomics/.
Short monomer distillation set - 400k short (sequence length < 200 amino acid) monomers from the MGNIFY database - https://www.ebi.ac.uk/metagenomics/.
Disordered

...

Usage examples

Looking at an OpenFold3 MSA in a Browser-Based Notebook on Scigantic by Scigantic
Deploying OpenFold3 with NVIDIA NIMs on Brev & AWS EC2 by Glòria Macià
OpenFold3-preview2 Technical Report by The OpenFold3 Team
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J, et al

See 4 usage examples →

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Usage examples

See 4 usage examples →

Synthea synthetic patient generator data in OMOP Common Data Model

bioinformaticshealthlife sciencesnatural language processingus

The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and gov...

Usage examples

Create data science environments on AWS for health analysis using OHDSI by James Wiggins
Predict patient health outcomes using OHDSI and machine learning on AWS by James Wiggins
Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon Comprehend Medical by James Wiggins
OHDSIonAWS by James Wiggins

See 4 usage examples →

The Impact of Variation on Function Consortium (IGVF)

bioinformaticsbiologygeneticgenomiclife sciences

The IGVF (Impact of Genomic Variation on Function) Consortium aims to understand how genomic variation affects genome function, which in turn impacts phenotype. The NHGRI is funding this collaborative program that brings together teams of investigators who will use state-of-the-art experimental and computational approaches to model, predict, characterize and map genome function, how genome function shapes phenotype, and how these processes are affected by genomic variation. These joint efforts will produce a catalog of the impact of genomic variants on genome function and phenotypes.
The Da...

Usage examples

See 4 usage examples →

UK Biobank Linkage Disequilibrium Matrices

geneticgenome wide association studygenomiclife sciencespopulation genetics

Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.

Usage examples

PolyFun and PolyPred software by Omer Weissbrod
Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores by Weissbrod et al.
Functionally informed fine-mapping and polygenic localization of complex trait heritability by Weissbrod et al.
PolyFun Wiki by Omer Weissbrod

See 4 usage examples →

UK Biobank Pan-Ancestry Summary Statistics

geneticgenome wide association studygenomiclife sciencespopulation genetics

A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.

Usage examples

Hail by Hail Team
Pan-ancestry genetic analysis of the UK Biobank by Pan UKBB Team
Hail on AWS Quick Start by Amazon Web Services and PrivoIT
Hail Tutorials by Hail Team

See 4 usage examples →

1000 Genomes

fastqgeneticgenomiclife scienceswhole genome sequencing

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Usage examples

See 3 usage examples →

AI3 Protein-Ligand Binding Affinity Dataset

healthlife sciencesmachine learningmolecular dynamicspharmaceuticalproteinsimulations

The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (mo...

Usage examples

See 3 usage examples →

AdaptiveFlow Ligand Libraries

bioinformaticslife sciencesmedicinepharmaceuticalstructural biology

AdaptiveFlow Versions of Ligand Libraries in Ready-To-Dock Format

Usage examples

See 3 usage examples →

Allen Ivy Glioblastoma Atlas

biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology

This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.

Usage examples

See 3 usage examples →

Allen Mouse Brain Atlas

biologygene expressiongeneticimage processingimaginglife sciencesMus musculusneurobiologytranscriptomics

The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an...

Usage examples

See 3 usage examples →

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Usage examples

Genomic Data Commons by National Cancer Institute
Functional Genomic Landscape of Acute Myeloid Leukemia by Jeffrey W. Tyner, Cristina E. Tognon, Dan Bottomly et al.
Clinical resistance to crenolanib in acute myeloid leukemia due to diverse molecular mechanisms by Zhang H, Savage S, Schultz AR, Bottomly D, White L, Segerdell E, et al.

See 3 usage examples →

BraiDyn-BC: Cued lever-pull task dataset

calcium imagingimaginglife sciencesMus musculusneurosciencevideo

The BraiDyn-BC (Brain Dynamics underlying emergence of Behavioral Change) Database offers an extensive, multimodal dataset that links wide-field calcium imaging of the mouse neocortex to comprehensive behavioral measurements during a behavioral task. As one of the contents in this database, we newly provide a dataset that includes 15 sessions spanning two weeks of motor skill learning, in which 25 mice were trained to pull a lever to obtain water rewards. Simultaneous high-speed videography captures body, facial, and eye movements, and environmental parameters are monitored. The dataset also ...

Usage examples

A set of libraries used for generating the dataset by Keisuke Sehara, Ryo Aoki, Shoya Sugimoto
A multimodal dataset linking wide-field calcium imaging to behavior changes in mice during an operant lever-pull task by Kondo M, Sehara K, Harukuni R, Aoki R, Sugimoto S, Tanaka YR, Matsuzaki M, Nakae K
Detailed usage tutorials on Google Colab by Keisuke Sehara

See 3 usage examples →

COBRA

cancercomputational pathologycomputer visiondeep learninghistopathologylife sciences

This page describes the COBRA (Classification Of Basal cell carcinoma, Risky skin cancers and Abnormalities) skin pathology dataset, which comprises over 7000 histopathology whole-slide-images related to the diagnosis of basal cell carcinoma skin cancer, the most commonly diagnosed cancer. The dataset includes biopsies and excisions and is divided into four groups. The first group contains about 2,500 BCC biopsies with subtype labels, while the second group includes 2,500 non-BCC biopsies with different types of skin dysplasia. The third group has 1,000 labelled risky cancer biopsies, includin...

Usage examples

See 3 usage examples →

COVID-19 Harmonized Data

coronavirusCOVID-19life sciences

A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis

Usage examples

See 3 usage examples →

Cell Organelle Segmentation in Electron Microscopy (COSEM) on AWS

cell biologycomputer visionelectron microscopyimaginglife sciencesorganelle

High resolution images of subcellular structures.

Usage examples

Whole-cell organelle segmentation in volume electron microscopy by Lisa Heinrich, Davis Bennett, David Ackerman, Woohyun Park, Jon Bogovic, Nils Eckstein, et al.
Correlative three-dimensional super-resolution and block-face electron microscopy of whole vitreously frozen cells. by David P. Hoffman, Gleb Shtengel, C. Shan Xu, Kirby R. Campbell, Melanie Freeman, Lei Wang, Daniel E. Milkie, H. Amalia Pasolli, Nirmala Iyer, John A. Bogovic, Daniel R. Stabley, Abbas Shirinifard, Song Pang, David Peale, Kathy Schaefer, Wim Pomp, Chi-Lun Chang, Jennifer Lippincott-Schwartz, Tom Kirchhausen1, David J. Solecki, Eric Betzig, Harald F. Hess
Enhanced FIB-SEM systems for large-volume 3D imaging by C. Shan Xu, Kenneth J. Hayworth, Zhiyuan Lu, Patricia Grob, Ahmed M. Hassan, José G. García-Cerdán, Krishna K. Niyogi, Eva Nogales, Richard J. Weinberg, Harald F. Hess.

See 3 usage examples →

CitrusFarm Dataset

agriculturecomputer visionIMUlidarlife scienceslocalizationmappingrobotics

CitrusFarm is a multimodal agricultural robotics dataset that provides both multispectral images and navigational sensor data for localization, mapping and crop monitoring tasks.

It was collected by a wheeled mobile robot in the Agricultural Experimental Station at the University of California Riverside in the summer of 2023.
It offers a total of nine sensing modalities, including stereo RGB, depth, monochrome, near-infrared and thermal images, as well as wheel odometry, LiDAR, IMU and GPS-RTK data.
It comprises seven sequences collected from three citrus tree fields, featuring various tree spe

...

Usage examples

Python scripts used in the data collection and post-processing by Hanzhe Teng et al.
Multimodal Dataset for Localization, Mapping and Crop Monitoring in Citrus Tree Farms by Hanzhe Teng, Yipeng Wang, Xiaoao Song and Konstantinos Karydis
Python script to download this dataset by Hanzhe Teng et al.

See 3 usage examples →

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.

Usage examples

Genomic Data Commons by National Cancer Institute
Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson, Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., Monica Kasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D., Ph.D., Wenming Xiao, Ph.D., et al.
A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd, Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, Zhao H, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, Gascoyne RD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M, Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM

See 3 usage examples →

DeepDrug Protein Embeddings Bank (DPEB)

bioinformaticslife sciencesmachine learningproteinstructural biology

DPEB is a multimodal database of human protein embeddings integrating four biologically complementary representations—AlphaFold2, BioEmbeddings, ESM-2, and ProtVec—designed for enhanced protein-protein interaction prediction and functional classification.

Usage examples

See 3 usage examples →

Exceptional Responders Initiative

cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...

Usage examples

Genomic Data Commons by National Cancer Institute
GDC Legacy Archive by National Cancer Institute
The Exceptional Responders Initiative: Feasibility of a National Cancer Institute Pilot Study by Barbara A. Conley, Lou Staudt, et al.

See 3 usage examples →

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

cancergenomiclife sciences

The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

Usage examples

Genomic Data Commons by National Cancer Institute
High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis by Ryan J. Hartmaier, Lee A. Albacker, Juliann Chmielecki, Mark Bailey, Jie He, Michael E. Goldberg, Shakti Ramkissoon, James Suh, Julia A. Elvin, Samuel Chiacchia, Garrett M. Frampton, Jeffrey S. Ross, Vincent Miller, Philip J. Stephens and Doron Lipson
Targeted next-generation sequencing of advanced prostate cancer identifies potential therapeutic targets and disease heterogeneity. by Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D, Tagawa ST, Nanus DM, Stephens PJ, Mosquera JM, Cronin MT, Rubin MA

See 3 usage examples →

Golden Retriever Lifetime Study: Whole genome genotyping of Golden Retrievers on Axiom HD Arrays

genomegenotypinggolden retriever lifetime studylife sciencesmorris animal foundation

Morris Animal Foundation’s Golden Retriever Lifetime Study is a longitudinal, prospective study following 3044 golden retrievers. The Study’s purpose is to identify the nutritional, environmental, lifestyle and genetic risk factors for cancer and other diseases. The Golden Oldie’s study enrolled an additional cohort of golden retrievers that had reached the age of 12 years or older and had not yet been diagnosed with a malignant cancer. This population can be used as a control group for conditions with high mortality in younger age. This dataset contains the data for ~1.1 million genetic marke...

Usage examples

GRLS GWAS Tutorial by Tamer Mansour
Cohort profile: The Golden Retriever Lifetime Study (GRLS) by Julia Labadie, Brenna Swafford, Mara DePena, Kathy Tietje, Rodney Page, Janet Patterson-Kane
The Golden Retriever Lifetime Study: establishing an observational cohort study with translational relevance for human health by Michael K. Guy, Rodney L. Page, Wayne A. Jensen, Patricia N. Olson, J. David Haworth, Erin E. Searfoss, and Diane E. Brown

See 3 usage examples →

Human and Mammalian Brain Atlas

biologygene expressionHomo sapienslife sciencesMus musculusneurobiologynon-human primatesingle-cell transcriptomics

Human and Mammalian Brain Atlas (HMBA) is a major atlas of the BRAIN Initiative Cell Atlas Network (BICAN) that proposes to establish a comprehensive, highly granular cell atlas in complete adult human, macaque, and marmoset brains that links brain structure, function and cellular architecture. Release artifacts have been made available in this OpenData bucket to enable utilization along with their paper publications by the neuroscience community.

Usage examples

See 3 usage examples →

I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.

Usage examples

The International Cardiac Arrest Research (I-CARE) Consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.

See 3 usage examples →

Imaging MIT Licensed data and models

biodiversityBiohubbioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at Biohub and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Quickstart Tutorial for CELL-Diff by Biohub
Documentation for CELL-Diff by Biohub
SubCell: Vision foundation models for microscopy capture single-cell biology by Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N Hansen, William D. Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera Navarro, Theofanis Karaletsos, Emma Lundberg
Quickstart Tutorial for SubCell by Biohub
Documentation for SubCell by Biohub

See 6 usage examples →

Kraken2 NCBI RefSeq Complete V205 database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205

Usage examples

Kraken2 by Derrick Wood, Jennifer Lu and Ben Langmead
From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools by Robyn J. Wright, Andre M. Comeau and Morgan G.I. Langille
Using an Amazon Machine Image for analysing samples with Kraken2 by Robyn Wright

See 3 usage examples →

MIMIC-III (‘Medical Information Mart for Intensive Care’)

bioinformaticshealthlife sciencesnatural language processingus

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...

Usage examples

MIMIC-code GitHub repository by Alistair Johnson
Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data by Ujjwal Ratan, Nihir Chadderwala, and Parminder Bhatia
Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena by James Wiggins, Alistair Johnson

See 3 usage examples →

Medical Segmentation Decathlon

computed tomographyhealthimaginglife sciencesmagnetic resonance imagingmedicineniftisegmentation

With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validati...

Usage examples

MONAI: Getting Started by MONAI Development Team
Pytorch-Integrated MSD Data Loader by MONAI Development Team
A large annotated medical image dataset for the development and evaluation of segmentation algorithms by Simpson A. L., Antonelli M., Bakas S., Bilello M., Farahana K., van Ginneken B., et al

See 3 usage examples →

NASA Space Biology Open Science Data Repository (OSDR)

bioinformaticsbiologyGeneLabgenomicimaginglife sciencesspace biology

NASA’s Space Biology Open Science Data Repository (OSDR) introduces a one-stop site where users can explore and contribute a variety of NASA open science biological data. This site consolidates data from the Ames Life Sciences Data Archive (ALSDA) and GeneLab and includes information about the broader NASA Open Science and Open Data initiatives, all at one centralized location. Our mission is to maximize the utilization of the valuable biological research resources and enable new discoveries.

OSDR introduces access to data generated from spaceflight and space relevant experiments that explore ...

Usage examples

NASA GeneLab: interfaces for the exploration of space omics data by Daniel C Berrios, Jonathan Galazka, Kirill Grigorev, Samrawit Gebre, Sylvain V Costes
GeneLab: Omics database for spaceflight experiments by Shayoni Ray, Samrawit Gebre, Homer Fogle, Daniel C Berrios, Peter B Tran, Jonathan M Galazka, Sylvain V Costes
Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration by Ryan T. Scott, Kirill Grigorev, Graham Mackintosh, Samrawit G. Gebre, Christopher E. Mason, Martha E. Del Alto, Sylvain V. Costes

See 3 usage examples →

National Cancer Institute Imaging Data Commons (IDC) Collections

cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmedical imagingmicroscopyradiology

Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.

Usage examples

See 3 usage examples →

ONT Methylation Benchmarking Datasets

bambenchmarkbioinformaticsepigenomicsgenomiclife scienceslong read sequencing

ONT Methylation Benchmarking Datasets are generated to benchmark existing methylation-calling tools on the Oxford Nanopore sequencing platform using their recent R10.4.1 flowcell chemistry. It spans a diverse range of species, including bacteria (E. coli, H. pylori J99, H. pylori 26695, A. variabilis, T. denticola), plants (Rice, Arabidopsis), and mammals (mouse, human).In addition, the dataset includes EMSeq data for E. coli, plant, and mouse samples, which can serve as ground truth for methylation studies. It also provides unmethylated whole-genome amplified (WGA) DNA for H. pylori 26695 and...

Usage examples

Methylation calling using ONT methylation benchmarking dataset by Onkar Kulkarni
Running Benchmarking Pipeline (Nextflow/Snakemake) on an Example Dataset using AWS by Onkar Kulkarni
Comprehensive benchmarking of tools for nanopore-based detection of DNA methylation by Kulkarni et al.

See 3 usage examples →

Open Human Genome Library

bioinformaticsbiologygenomiclife sciences

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

Usage examples

AGC: compact representation of assembled genomes with fast queries and updates by Sebastian Deorowicz, Agnieszka Danek, Heng Li
BWT construction and search at the terabase scale by Heng Li
Using OpenHGL data by Heng Li

See 3 usage examples →

OpenProteinSet

alphafoldlife sciencesmsaopen source softwareopenfoldproteinprotein foldingprotein template

Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth. MSAs were generated with HHBlits (-n3) and JackHMMER against MGnify, BFD, UniRef90, and UniClust30 while templates were identified from PDB70 with HHSearch, all according to procedures outlined in the supplement to the AlphaFold 2 Nature paper, Jumper et al. 2021. We expect the database to be broadly useful to structural biologists training or valid...

Usage examples

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J, et al
OpenProteinSet: Training data for structural biology at scale by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian, et al
Run inference at scale for OpenFold, a PyTorch-based protein folding ML model, using Amazon EKS by Shubha Kumbadakone, Ankur Srivastava, and Sachin Kadyan

See 3 usage examples →

OpenRoboCare Multi-Modal Expert Demonstration Dataset for Robot-Assisted Caregiving

computer visionhealthlife sciencesmachine learningrobotics

A comprehensive multimodal dataset capturing real-world caregiving routines from 21 occupational therapists performing 15 daily caregiving tasks. The dataset includes synchronized RGB-D video, tactile sensing, eye-gaze tracking, pose annotations, and action labels across 315 sessions totaling 19.8 hours of expert demonstrations. Data modalities include anonymized RGB images, depth maps, 44-sensor tactile readings, 2D/3D pose tracking, temporal action annotations, and first/third-person videos, enabling research in robot learning from demonstration, multimodal perception, and safe human-robot i...

Usage examples

OpenRoboCare Dataset Viewer by Cornell University EmPRISE Lab
OpenRoboCare: A Multimodal Multi-Task Expert Demonstration Dataset for Robot Caregiving by Liang X, Liu Z, Lin K, et al.
Get To Know A Dataset: OpenRoboCare by Cornell University EmPRISE Lab

See 3 usage examples →

ProteinGym

bioinformaticsbiologydeep learninglife sciencesmachine learningprotein

ProteinGym is a benchmark suite for assessing the performance of protein fitness prediction and design models. It comprises a large curated collection of 200+ high-throughput experimental assays (~3M mutated sequences), as well as clinical annotations from experts about the pathogenicity of mutants in over 3k human genes.

Usage examples

ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design by Pascal Notin, et al.
Scoring ProteinGym assays with TranceptEVE by Daniel Ritter
ProteinGym website by Pascal Notin & Daniel Ritter

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Usage examples

See 3 usage examples →

SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.

Usage examples

SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography by Jing, J., Ge, W., Struck, A. F., Fernandes, M., Hong, S., An, S., et al.
Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation by Jing J, Ge W, Hong S, Fernandes MB, Lin Z, Yang C et al., et al.
IIIC-SPaRCNet Github Repository by Brain Data Science Platform (BDSP)

See 3 usage examples →

STOIC2021 Training

computed tomographycomputer visioncoronavirusCOVID-19grand-challenge.orgimaginglife sciencesSARS-CoV-2

The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.

Usage examples

STOIC2021 Challenge by Diagnostic Image Analysis Group, Radboudumc, Nijmegen
Study of Thoracic CT in COVID-19: The STOIC Project by Revel MP, Boussouar S, de Margerie-Mellon C, Saab I, Lapotre T, Mompoint D, et al.
How Well Do Self-Supervised Models Transfer to Medical Imaging? by Anton J, Castelli L, Chan MF, Outthers M, Tang WH, Cheung V, et al.

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Usage examples

Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.
New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon

See 3 usage examples →

Transcriptomic MIT Licensed data and models

biodiversityBiohubbiologybiomolecular modelingcell biologyhdf5life sciencesmachine learningmodelproteintranscriptomics

This dataset contains a transcriptomics biological data and models. The models embed transcriptomic data and facilitate transcriptomic analysis. The data is sourced and curated by a team of experts at Biohub and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

scGenePT Perturbation Prediction Tutorial by Biohub
scGenePT: Is language all you need for modeling single-cell perturbations? by Ana-Maria Istrate, Donghui Li, Theofanis Karaletsos
Quickstart Tutorial for Transcriptformer by Biohub
Documentation for Transcriptformer by Biohub
A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model by Pearce, J. D., et. al.

See 6 usage examples →

UCSF Renal Mass CT Dataset

cancercomputed tomographylife sciencesmedical imagingmedicineradiology

This dataset provides a set of 831 3D Multiphase CT exams of renal masses, registered across phases with annotations identifying the masses

Usage examples

See 3 usage examples →

Variant Effect Predictor (VEP) and the Loss-Of-Function Transcript Effect Estimator (LOFTEE) Plugin

genome wide association studygenomiclife scienceslofteevep

VEP determines the effect of genetic variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. The European Bioinformatics Institute produces the VEP tool/db and releases updates every 1 - 6 months. The latest release contains 267 genomes from 232 species containing 5567663 protein coding genes. This dataset hosts the last 5 releases for human, rat, and zebrafish. Also, it hosts the required reference files for the Loss-Of-Function Transcript Effect Estimator (LOFTEE) plugin as it is commonly used with VEP.

Usage examples

See 3 usage examples →

run_dbcan CAZyme and CGC annotation database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with run_dbcan (CAZyme and CGC annotation), including CAZyme, Transporter, Transcription factor, Signaling Transduction Protein, Sulfatase, Peptidase, and Polysaccharide utilization Loci.

Usage examples

run_dbcan Documentation by Xinpeng Zhang; Haidong Yi; Yanbin Yin
run_dbcan by Xinpeng Zhang; Haidong Yi; Jinfang Zheng; Le Huang; Qiwei Ge; Yanbin Yin
dbCAN3: automated carbohydrate-active enzyme and substrate annotation by Jinfang Zheng, Qiwei Ge, Yuchen Yan, Xinpeng Zhang, Le Huang, Yanbin Yin

See 3 usage examples →

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...

Usage examples

See 2 usage examples →

Allen Institute for Brain Science - Synaptic Physiology Public Data Set

electrophysiologyHomo sapienslife sciencesMus musculusneurobiologysignal processing

This is a large-scale survey that describes the physiology (strength, kinetics, and short term plasticity) of thousands of synapses from patch clamp experiments in mouse visual cortex and human middle temporal gyrus.

Usage examples

Local connectivity and synaptic dynamics in mouse and human neocortex by Campagnola L., Seeman S., et al.
aisynphys python package for accessing synaptic physiology data by Campagnola L., Seeman S., et al.

See 2 usage examples →

Allen Institute for Neural Dynamics - Extracellular Electrophysiology Compression Benchmark

electrophysiologylife sciencesMus musculusneurobiologysignal processing

Extracellular electrophysiology data is growing at a remarkable pace. This data, collected neuropixels probes by the Allen Institute and the International Brain Lab can be used to benchmark throughput rates and storage ratios of various data compression algorithms.

Usage examples

See 2 usage examples →

Allen Institute for Neural Dynamics - Extracellular Electrophysiology Hybrid Evaluation Benchmark

electrophysiologylife sciencesMus musculusneurobiologysignal processing

Evaluation of spike sorting methods is a challenging task, as it requires both ground-truth data and a variety of sorting algorithms to compare against. This dataset contains a set of hybrid data specifically designed for benchmarking spike sorting methods.

Usage examples

See 2 usage examples →

Allen institute intratelencephalic neuron connectivity paper supplemental data

connectomicselectron microscopyimaginglife sciencesneuroscience

organized and data files for plotting figures in the manuscript of VISp intratelencephalic (IT) neuron connectivity using MICrONS EM dataset

Usage examples

IT-circuit-Figures-clean by valering_z
Cell-type-specific parallel pathways in the canonical cortical microcircuit by Chi Zhang, Casey M Schneider-Mizell, Bethanny P Danskin, Rachael Swanstrom, Erika Neace, Emily Joyce, Benjamin D Pedigo, Forrest C Collman, and Nuno Maçarico da Costa

See 2 usage examples →

Animal Tracking - Acoustic Telemetry - Quality controlled detections

biologylife sciencesmarine mammalsoceans

Since 2007, the Integrated Marine Observing System’s Animal Tracking Facility (formerly known as the Australian Animal Tracking And Monitoring System (AATAMS)) has established a permanent array of acoustic receivers around Australia to detect the movements of tagged marine animals in coastal waters. Simultaneously, the Animal Tracking Facility developed a centralised national database (https://animaltracking.aodn.org.au/) to encourage collaborative research across the Australian research community and provide unprecedented opportunities to monitor broad-scale animal movements. The resulting da...

Usage examples

See 2 usage examples →

Biodiversity Heritage Library Metadata and Page Images

biodiversitybioinformaticslife sciences

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access.

Usage examples

See 5 usage examples →

Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset

fluorescence imagingGeneLabgeneticgenetic mapslife sciencesmicroscopyNASA SMD AI

Fluorescence microscopy images of individual nuclei from mouse fibroblast cells, irradiated with Fe particles or X-rays with fluorescent foci indicating 53BP1 positivity, a marker of DNA damage. These are maximum intensity projections of 9-layer microscopy Z-stacks.

Usage examples

NASA SMD AI Workshop Report by SMD Artificial Intelligence (AI) Initiative
Dose, LET and Strain Dependence of Radiation-Induced 53BP1 Foci in 15 Mouse Strains Ex Vivo Introducing Novel DNA Damage Metrics by Sébastien Penninckx, Egle Cekanaviciute, Charlotte Degorre, Elodie Guiet, Louise Viger, Stéphane Lucasb, Sylvain V. Costes

See 2 usage examples →

Biological and Physical Sciences (BPS) RNA Sequencing Benchmark Training Dataset

gene expressionGeneLabgeneticgenetic mapslife sciencesNASA SMD AIspace biology

RNA sequencing data from spaceflown and control mouse liver samples, sourced from NASA GeneLab and augmented with generative adversarial network.

Usage examples

NASA SMD AI Workshop Report by SMD Artificial Intelligence (AI) Initiative
Adversarial generation of gene expression data by Ramon Viñas, Helena Andrés-Terré, Pietro Liò, Kevin Bryson

See 2 usage examples →

Brain Encoding Response Generator (BERG)

brain modelscomputer visiondeep learninglife sciencesmachine learningneuroimagingneuroscience

Brain Encoding Response Generator (BERG) is a resource consisting of multiple pre-trained encoding models of the brain and an accompanying Python package to generate accurate in silico neural responses to arbitrary stimuli with just a few lines of code.

Usage examples

Quickstart Tutorial by Domenic Bersch
In-Silico fMRI Data Tutorial by Alessandro Gifford
Brain Encoding Response Generator (BERG) by Alessandro Gifford
In-Silico EEG Data Tutorial by Alessandro Gifford
The Brain Encoding Response Generator by Alessandro Gifford

See 5 usage examples →

Brain/MINDS Marmoset Connectivity Resource on AWS

brain imagesimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscienceniftinon-human primate

Brain/MINDS Marmoset Connectivity Resource (BMCR) is a resource that provides access to anterograde and retrograde neuronal tracer data, made available by Brain/MINDS project. It is currently restricted to injections into the prefrontal cortex of a marmoset brain but is planned to include injections into entire cortical areas and representative subcortical brain regions.

Usage examples

Marmoset PFC connectome by Akiya Watakabe, Henrik Skibbe and Tetsuo Yamamori
Explorer Tutorials by Henrik Skibbe
The Brain/MINDS Marmoset Connectivity Resource - An open-access platform for cellular-level tracing and tractography in the primate brain by H. Skibbe, M.F. Rachmadi, K. Nakae, C. E. Gutierrez, J. Hata, H. Tsukada, C. Poon, K. Doya, P. Majka, M. G. P. Rosa, M. Schlachter, H. Okano, T. Yamamori, S. Ishii, M. Reisert, A. Watakabe.
BMCR website by Henrik Skibbe
Local and long-distance organization of prefrontal cortex circuits in the marmoset brain. by Watakabe A, Skibbe H, Nakae K, Abe H, Ichinohe N, Rachmadi MF, Wang J, Takaji M, Mizukami H, Woodward A, Gong R, Hata J, Van Essen DC, Okano H, Ishii S, Yamamori T.

See 5 usage examples →

BrainGlobe Atlases

biologydigital preservationHomo sapiensimage processingimaginglife scienceslight-sheet microscopymagnetic resonance imagingmedical imagingmicroscopyMus musculusneurobiologyneuroimagingneuroscienceRattus norvegicusvolumetric imagingzarr

BrainGlobe provides an archive and standardised interface to anatomical atlases from multiple species. This dataset includes these atlases, and other data (e.g. sample neuroanatomy data) to allow the greatest use of the atlases.

Usage examples

See 2 usage examples →

BrainSeq - Neurogenomics to Drive Novel Target Discovery for Neuropsychiatric Disorders

gene expressiongenotypinglife sciencestranscriptomics

This ambitious project seeks to characterize the genetic and epigenetic regulation of multiple facets of transcription in distinct brain regions across the human lifespan in samples of major neuropsychiatric disorders and controls. Initially focused on schizophrenia and mood disorders, the goal of this consortium is to elucidate the underlying molecular mechanisms of genetic associations with the goal of identifying novel therapeutic targets. The consortium currently consists of seven pharmaceutical companies and a not-for-profit medical research institution working as a precompetitive team to...

Usage examples

See 2 usage examples →

Broad Genome References

bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index

Broad maintained human genome reference builds hg19/hg38 and decoy references.

Usage examples

Using Amazon FSx for Lustre for Genomics Workflows on AWS by W. Lee Pang
Advancing NGS quality control to enable measurement of actionable mutations in circulating tumor DNA by Willey J. C., Morrison T. B., Austermiller B., Crawford E. E., et al (2021)

See 2 usage examples →

COVID-19 Data Lake

amazon.sciencebioinformaticsbiologycoronavirusCOVID-19healthlife sciencesmedicineMERSSARS

A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and rela...

Usage examples

See 5 usage examples →

Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer

cancergenomiclife sciencesSTRIDEStranscriptomics

The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Usage examples

Genome-wide discovery of somatic coding and noncoding mutations in pediatric endemic and sporadic Burkitt lymphoma by Grande B. M., Gerhard D. S., Jiang A., Griner N. B., Abramson J. S., Alexander T. B., et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

Cell Painting Image Collection

biologycell imagingcell paintingfluorescence imaginghigh-throughput imagingimaginglife sciencesmicroscopy

The Cell Painting Image Collection is a collection of freely downloadable microscopy image sets. Cell Painting is an unbiased high throughput imaging assay used to analyze perturbations in cell models. In addition to the images themselves, each set includes a description of the biological application and some type of "ground truth" (expected results). Researchers are encouraged to use these image sets as reference points when developing, testing, and publishing new image analysis algorithms for the life sciences. We hope that the this data set will lead to a better understanding of w...

Usage examples

See 2 usage examples →

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

bioinformaticsbiologygenomiclife sciencesmappingmedicinereference indexwhole genome sequencing

Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.

Usage examples

Table of contents for tutorials for constituent tools by Ben Langmead
Reducing reference bias using multiple population reference genomes by Chen et al (2020)

See 2 usage examples →

Cryo-EM SPA Workflow Records

life sciencesmachine learningstructural biology

The “Cryo-EM SPA Workflow Records” contains all outputs of all processing steps involved in cryogenic electron microscopy (cryo-EM) single particle analysis (SPA), including both intermediate and final output data. The primary focus will be on data generated by RELION and CryoSPARC, two widely used software packages for :Cryo-EM SPA. These records will be archived systematically. To ensure the data remains reproducible while minimizing storage demands, large-sized files that can be regenerated will be excluded prior to registration. The aim is to retain only the essential metadata, processing ...

Usage examples

See 2 usage examples →

DNAStack COVID19 SRA Data

bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...

Usage examples

Viral lineage assignment by Heather Ward
Viral AI by DNAstack

See 2 usage examples →

Dendritic Consortium Multimodal Dataset

brain imagesbrain modelselectron microscopyelectrophysiologyimaginglife sciencesMus musculusneurobiologyneuroimagingneurophysiologyneurosciencesimulation neurosciencesingle neuron models

The Dendritic Consortium provides a multimodal dataset integrating calcium and voltage imaging, electrophysiology, electron microscopy, proteomics, and computational models of Baz1a pyramidal neurons in the mouse primary visual cortex (V1).

Usage examples

See 2 usage examples →

E11bio PRISM

bioinformaticsbiologybrain imagescell imagingcomputer visionfluorescence imaginghigh-throughput imagingimage processingimagingion channelslife sciencesmachine learningmicroscopymorphological reconstructionsMus musculusneurobiologyneuroimagingneuroscienceproteinsegmentationzarr

This dataset was generated using E11.bio's PRISM technology (Protein Reconstruction and Identification through Multiplexing), a platform that combines viral barcoding, expansion microscopy, and iterative immunolabeling for large-scale neuronal reconstruction.Neurons in the mouse hippocampal CA3 were transduced with a library of adeno-associated viruses (AAVs) encoding diverse “protein bits”—small epitope tags that act as combinatorial barcodes. Tissue was then processed with an expansion microscopy protocol, physically enlarging the sample ~5× to achieve an effective voxel size of ~35 × 3...

Usage examples

See 2 usage examples →

EMBER Open Datasets

activity detectionactivity recognitionanalyticsbioinformaticsbrain imagesbrain modelscloud computingcomputer visiondeep learningelectrophysiologyGPSh5hdf5Homo sapiensjsonlife scienceslocalizationmachine learningmagnetic resonance imagingMus musculusneurobiologyneuroimagingneurophysiologyneurosciencenon-human primatesignal processingspeech processingzarr

This is data from, Ecosystem for Multi-modal Brain-behavior Experimentation and Research (EMBER), It contains time series behavioral and neuroscience data from animal and deidentified human subjects across multiple modalities.

Usage examples

Mapping the landscape of social behavior by Ugne Klibaite, Tianqing Li, Diego Aldarondo, Jumana F Akoad, Bence P Ölveczky, Timothy W Dunn.
Get To Know A Dataset - EMBER by EMBER Team

See 2 usage examples →

EMory BrEast Imaging Dataset (EMBED)

biasbiologycancerhealthimaginglife sciencesmammographyx-ray

EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT) images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity classes. This release represents 20% of the total 2D and C-view dataset and is available for research use. DBT, US, and MRI exams will be added at a ...

Usage examples

See 2 usage examples →

Emory Knee Radiograph (MRKR) dataset

bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...

Usage examples

Emory Knee Radiograph Dataset by Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi.
Example Notebook by Emory-HITI

See 2 usage examples →

FLAb: Fitness Landscapes for Antibodies

life sciencesmachine learningproteinprotein template

FLAb is the largest publicly available therapeutic antibody dataset designed to train and benchmark protein AI models. It provides open-access, high-quality developability data on diverse therapeutic properties, including expression, thermostability, immunogenicity, aggregation, polyreactivity, binding affinity, and pharmacokinetics.

Usage examples

See 2 usage examples →

GATK Structural Variation (SV) Data

bioinformaticsbiologycromwellgatk-svgeneticgenomiclife sciencesstructural variation

This dataset holds the data needed to run a structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data in AWS.

Usage examples

Structural Variant Analysis on AWS with Amazon FSx for Lustre by Goldfinch Bio and Loka Inc.
AWS Setup & Execution by Goldfinch Bio and Loka Inc.

See 2 usage examples →

Genomic Characterization of Metastatic Castration Resistant Prostate Cancer

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Biopsies of castration resistant prostate cancer metastases were subjected to whole genome sequencing (WGS), along with RNA-sequencing (RNA-Seq). The overarching goal of the study is to illuminate molecular mechanisms of acquired resistance to therapeutic agents, and particularly androgen signaling inhibitors, in the treatment of metastatic castration resistant prostate cancer (mCRPC). This study is made available on AWS via the NIH STRIDES Initiative.

Usage examples

Genomic characterization of metastatic castration-resistant prostate cancer patients undergoing PSMA radioligand therapy: A single-center experience by Swayamjeet Satapathy, Chandan K Das, et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

Genoxus Annotation

geneticgenomiclife sciencesvariant annotationwhole exome sequencingwhole genome sequencing

Genoxus Annotation is a harmonized and curated collection of human genetic variant databases designed to support accurate and salable variant annotation. Variant annotation following genetic testing such as whole genome sequencing (WGS) or whole exome sequencing (WES) is a critical step in identifying and interpreting disease-associated genetic factors. As sequencing technologies continue to generate large volumes of genomic data, robust and well-structured annotation resources are essential for translating raw variant calls into clinically meaningful insights. Genoxus Annotation v1.0 integrat...

Usage examples

See 2 usage examples →

Harvard Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University:Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH).

Usage examples

Harvard Electroencephalography Database by Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., et al.
Harvard-EEG-Database-Tools by Brain Data Science Platform (BDSP)

See 2 usage examples →

Harvard-Emory ECG Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.

Usage examples

Harvard Electroencephalography Database by Moura Junior, V.; Reyna, M.; Hong, S.; Gupta, A.; Ghanta, M.; Sameni, R., et al.
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.

See 2 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Usage examples

See 2 usage examples →

Human Cell Atlas

biologycell biologycell imaginggene expressiongenomegenomicHomo sapienslife sciencesMus musculussingle-cell transcriptomicstranscriptomics

The Human Cell Atlas (HCA) is a collaborative community of international scientists. Our mission is to create comprehensive reference maps of all the cells in the human body as a basis for both understanding human health and diagnosing, monitoring, and treating disease. The HCA registry has more than one thousand member scientists from hundreds of institutions around the world. The project is steered and governed by an Organizing Committee, co-chaired by Aviv Regev and Sarah Teichmann.

Usage examples

The Human Cell Atlas White Paper by Aviv Regev, Sarah Teichmann, Orit Rozenblatt-Rosen, Michael Stubbington, Kristin Ardlie, Ido Amit, Paola Arlotta, Gary Bader, Christophe Benoist, Moshe Biton, Bernd Bodenmiller, Benoit Bruneau, Peter Campbell, Mary Carmichael, Piero Carninci, Leslie Castelo-Soccio, Menna Clatworthy, Hans Clevers, Christian Conrad, Roland Eils, Jeremy Freeman, Lars Fugger, Berthold Goettgens, Daniel Graham, Anna Greka, Nir Hacohen, Muzlifah Haniffa, Ingo Helbig, Robert Heuckeroth, Sekar Kathiresan, Seung Kim, Allon Klein, Bartha Knoppers, Arnold Kriegstein, Eric Lander, Jane Lee, Ed Lein, Sten Linnarsson, Evan Macosko, Sonya MacParland, Robert Majovski, Partha Majumder, John Marioni, Ian McGilvray, Miriam Merad, Musa Mhlanga, Shalin Naik, Martijn Nawijn, Garry Nolan, Benedict Paten, Dana Pe'er, Anthony Philippakis, Chris Ponting, Steve Quake, Jayaraj Rajagopal, Nikolaus Rajewsky, Wolf Reik, Jennifer Rood, Kourosh Saeb-Parsy, Herbert Schiller, Steve Scott, Alex Shalek, Ehud Shapiro, Jay Shin, Kenneth Skeldon, Michael Stratton, Jenna Streicher, Henk Stunnenberg, Kai Tan, Deanne Taylor, Adrian Thorogood, Ludovic Vallier, Alexander van Oudenaarden, Fiona Watt, Wilko Weicher, Jonathan Weissman, Andrew Wells, Barbara Wold, Ramnik Xavier, Xiaowei Zhuang, Human Cell Atlas Organizing Committee
The Human Cell Atlas: towards a first draft atlas by Various authors
The Human Cell Atlas from a cell census to a unified foundation model by Jennifer E. Rood, Samantha Wynne, Lucia Robson, Anna Hupalowska, John Randell, Sarah A. Teichmann & Aviv Regev
The Human Cell Atlas: towards a first draft atlas by Various authors
The network effect: studying COVID-19 pathology with the Human Cell Atlas by Sarah Teichmann, Aviv Regev

See 5 usage examples →

Indexes for Kaiju

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

Usage examples

Fast and sensitive taxonomic classification for metagenomics with Kaiju by Peter Menzel et al (2016)
Quickstart Tutorial for downloading the index files and running Kaiju. by Peter Menzel

See 2 usage examples →

Integrative Analysis of Lung Adenocarcinoma in Environment and Genetics Lung cancer Etiology (Phase 2)

cancerepigenomicsgenomiclife sciencesSTRIDESwhole exome sequencingwhole genome sequencing

We performed whole genome sequencing and whole exome sequencing of 31 lung adenocarcinoma (LUAD) samples from the Environment And Genetics in Lung cancer Etiology (EAGLE) study. The EAGLE study is made available on AWS via the NIH STRIDES Initiative (https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).

Usage examples

See 2 usage examples →

LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge (LEOPARD) Dataset

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the all data for the LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge or LEOPARD.Prostate cancer, impacting 1.4 million men annually, is a prevalent malignancy (H. Sung et al., 2021). A substantial number of these individuals undergo prostatectomy as the primary curative treatment. The efficacy of this surgery is assessed, in part, by monitoring the concentration of prostate-specific antigen (PSA) in the bloodstream. While the role of PSA in prostate cancer screening is debatable (W. F. Clark et al., 2018; E. A. M. Heijnsdijk et al., 2018), it serves as a valuable biomarker for postprostatectomy follow-up in patients. Following successful surgery, PSA concentration is typically undetectable (<0.1 ng/mL) within 4-6 weeks (S. S. Goonewardene et al., 2014). However, approximately 30% of patie...

Usage examples

See 2 usage examples →

Multi-Anatomy Post-Surgical Magnetic Resonance Dataset (MAPSMR)

life sciencesmachine learningmagnetic resonance imagingmedical imaging

The MAPSMR dataset is a multi-organ, post-surgical MRI benchmark dataset focused on organ absence and altered anatomy after common abdominal and pelvic surgeries. The dataset includes cases such as cholecystectomy, prostatectomy, nephrectomy, colectomy, hepatectomy, and related procedures, with annotations identifying surgically absent organs and post-treatment anatomical changes.

Usage examples

Get to know a dataset - GEHCAI-MAPSMR by https://github.com/fastestimator
Decipher-MR: a vision-language foundation model for 3D MRI representations by Zhijian Yang, Noel DSouza, Istvan Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztian Koos, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaminathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, Erhan Bas

See 2 usage examples →

NIH NLM NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS

csvlife sciencesSTRIDEStxtxml

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on PMCID and version number:

The PMC Open Access (OA) Subset, which includes all articles in PMC that are available for reuse based on terms specified by the publisher. The majority of avai...

Usage examples

Extracting insights from PubMed articles using Amazon Q Business by Bharath Gunapati and Stefan Mationg
Accessing PMC Article Datasets Using Amazon Web Services by NCBI PMC

See 2 usage examples →

National Cancer Institute Center for Cancer Research - Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expression

cancergenomiclife sciences

The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas (DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Genomic Data Commons by National Cancer Institute
Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., et al.

See 2 usage examples →

OpenCRAVAT

geneticgenomiclife sciencessqlitetertiary analysisvariant annotation

OpenCRAVAT is a module variant annotation tool developed by KarchinLab at Johns Hopkins. This dataset is a mirror of the OpenCRAVAT store available at https://store.opencravat.org. You can configure OpenCRAVAT to use this mirror by editing the "cravat-system.yml" file. The path to this file is in the first output line of the command "oc config system". In that file, change the value of "store_url" to "https://opencravat-store-aws.s3.amazonaws.com".

Usage examples

Changing the OpenCRAVAT store url by Kyle Moad
OpenCRAVAT by Karchinlab

See 2 usage examples →

Oregon Health & Science University Chronic Neutrophilic Leukemia Dataset

cancergenomiclife sciences

The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rare hematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloid leukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms (MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Genomic Data Commons by National Cancer Institute
Genomic landscape of neutrophilic leukemias of ambiguous diagnosis by Zhang H, Wilmot B, Bottomly D et al.

See 2 usage examples →

Pancreatic Cancer Organoid Profiling

cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Usage examples

Genomic Data Commons by National Cancer Institute
Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer by Tiriac H, Belleau P, Engle DD, Plenker D, Deschênes A, Somerville TD, et al.

See 2 usage examples →

REDASA COVID-19 Open Data

coronavirusCOVID-19information retrievallife sciencesnatural language processingtext analysis

The REaltime DAta Synthesis and Analysis (REDASA) COVID-19 snapshot contains the output of the curation protocol produced by our curator community. A detailed description can be found in our paper. The first S3 bucket listed in Resources contains a large collection of medical documents in text format extracted from the CORD-19 dataset, plus other sources deemed relevant by the REDASA consortium. The second S3 bucket contains a series of documents surfaced by Amazon Kendra that were considered relevant for each medical question asked. The final S3 bucket contains the GroundTruth annotations cr...

Usage examples

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study by Uddhav Vaghela, Simon Rabinowicz, Paris Bratsos, Guy Martin, Epameinondas Fritzilas, et al.
Curadr - Curation Platform by REDASA Consortium, Imperial College London

See 2 usage examples →

RNA structure by fragmentation frequency

bioinformaticsgenomiclife sciencestranscriptomics

The fragSTRUC project devises a software to extract RNA secondary structure information from Illumina datasets, based on divalent ions in standard RNA-seq library preparation fragmenting sequences at non-base-paired regions of RNA.

Usage examples

Accessing the fragSTRUC dataset on AWS by Yuk Kei Wan and Leonard Schärfen
fragSTRUC: RNA structure by fragmentation frequency by Yuk Kei Wan and Leonard Schärfen

See 2 usage examples →

Reference Indexes for krepp

bioinformaticslife sciencesmetagenomicsmicrobiomereference index

krepp is an alignment-free method for estimating distances and phylogenetic placement of individual reads to many thousands of reference genomes in a scalable manner using k-mers. This dataset includes k-mer-based indexes consisting of ultra-large reference genome sets that can be efficiently analyzed using krepp.

Usage examples

See 2 usage examples →

Reference data for HiFi human WGS

genetichealthHomo sapienslife scienceslong read sequencingmappingvariant annotationvcfwhole genome sequencing

Reference data bundle for analyzing HiFi human whole genome sequencing data

Usage examples

See 2 usage examples →

Somatic Mosaicism across Human Tissues (SMaHT)

bambioinformaticsbiologygeneticgenomicimaginglife scienceswhole genome sequencing

The Somatic Mosaicism across Human Tissues (SMaHT) project is an NIH Common Fund consortium (2023-) aimed to comprehensively characterize somatic variation ("mosaicism") in normal human tissues. While most genetic studies have relied on blood-derived DNA, SMaHT captures the full spectrum of DNA variation across cell types, tissues, and organs from phenotypically normal individuals to better understand the role of somatic mosaicism in human development, aging, and disease progression.Researchers in the consortium develop and apply experimental and computational methods, paired with th...

Usage examples

The Somatic Mosaicism across Human Tissues Network by Coorens T, Oh J, Choi Y, Lim N, Zhao B, Voshall A et al.
Somatic Mosaicism across Human Tissues Data Portal by SMaHT Data Analysis Center (DAC)

See 2 usage examples →

Sounds of Central African landscapes

biodiversitybiologyecosystemsgeospatiallandlife sciencesnatural resourcesurvey

Archival soundscapes recorded in the rainforest landscapes of Central Africa, with a focus on the vocalizations of African forest elephants (Loxodonta cyclotis).

Usage examples

You can now hear rainforest sounds worldwide-here's why that matters by Rachel Fobar
Listen to the rainforest chorus that's helping scientists protect African elephants by Amazon Staff

See 2 usage examples →

TIGER Training

cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences

"This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemo...

Usage examples

See 2 usage examples →

Tabula Sapiens

Biohubbiologyencyclopedicgeneticgenomichealthlife sciencesmedicinesingle-cell transcriptomics

Tabula Sapiens is a benchmark, first-draft human cell atlas of over 1.1M cells from 28 organs of 24 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments. We...

Usage examples

The Tabula Sapiens: a multiple organ single cell transcriptomic atlas of humans by The Tabula Sapiens Consortium
Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues by The Tabula Sapiens Consortium, Stephen R Quake

See 2 usage examples →

UniProt

bioinformaticsbiologychemistryenzymegraphlife sciencesmoleculeproteinRDFSPARQL

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.

Usage examples

Exploring the UniProt protein knowledgebase with AWS Open Data and Amazon Neptune by Eric Greene, Rafa Xu, Yuan Shi (AWS)
UniProt SPARQL by Swiss-Prot Group at SIB Swiss Institute of Bioinformatics

See 2 usage examples →

Aging Mouse Brain Epigenetic

bamcramfastqgeneticgenomiclife sciencestranscriptomicswhole exome sequencingwhole genome sequencing

Aging is a major risk factor for neurodegenerative diseases, yet underlying epigenetic mechanisms remain unclear. Here, we generated a comprehensive single-nucleus cell atlas of brain aging across multiple brain regions, comprising 132,551 single-cell methylomes and 72,666 joint chromatin conformation-methylome nuclei. Integration with companion transcriptomic and chromatin accessibility data yielded a cross-modality taxonomy of 36 major cell types.

Usage examples

Cell-type-specific transposable element demethylation and TAD remodeling in the aging mouse brain by Zeng, Q., Wei, T., Klein, A., Bartlett, A., Liu, H., Nery, J.R., Castanon, R., Osteen, J., Johnson, N.D., Wang, W., Ding, W., Chen, H., Altshul, J., Kenworthy, M., Valadon, C., Owens, W., Wu, Z., Amaral, M.L., Song, Báez-Becerra, T.a.t.i.a.n.a., Cho, S., Chen, C., Willier, J., Cao, S., Rink, J., Lee, J., Barcoma, A., Arzavala, J., Emerson, N., Lu, Y.R., Ren, B., Behrens, M.a.r.g.a.r.i.t.a., Ecker, J.R.

See 1 usage example →

Allen Brain Observatory - Visual Coding AWS Public Data Set

electrophysiologyimage processingimaginglife sciencesMus musculusneurobiologyneuroimagingsignal processing

The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, c...

Usage examples

Use the Allen Brain Observatory – Visual Coding on AWS by Nika Keller, David Feng

See 1 usage example →

Allen Institute for Neural Dynamics - Mouse Neuroanatomy and Physiology Data

electrophysiologyimage processingimaginglife sciencesMus musculusneurobiologyneuroimagingsignal processing

The Allen Institute for Neural Dynamics (AIND) is committed to FAIR, Open, and Reproducible science. We therefore share all of the raw and derived data we collect publicly with rich metadata, including preliminary data collected during methods development, as near to the time of collection as possible.

Usage examples

AIND Open Data Access by David Feng, Saskia de Vries

See 1 usage example →

CHIMERA

cancercomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologylife sciencesmachine learningmedical image computingmedical imaging

This dataset contains the training data for the CHIMERA - Combining HIstology, Medical imaging (radiology) and molEcular data for medical pRognosis and diAgnosis challenge. The CHIMERA Challenge aims to advance precision medicine in cancer care by addressing the critical need for multimodal data integration. Despite significant progress in AI, integrating transcriptomics, pathology, and radiology across clinical departments remains a complex challenge. Clinicians are faced with large, heterogeneous datasets that are difficult to analyze effectively. AI has the potential to unify multimodal dat...

Usage examples

CHIMERA Challenge by Computational Pathology Group Radboudumc, Nijmegen

See 1 usage example →

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model

amazon.sciencebioinformaticshealthlife sciencesnatural language processingus

DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:

allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;
train researchers on the use and complexity of conducting anal

...

Usage examples

Predict patient health outcomes using OHDSI and machine learning on AWS by James Wiggins
Create data science environments on AWS for health analysis using OHDSI by James Wiggins
OHDSIonAWS by James Wiggins
Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon Comprehend Medical by James Wiggins

See 4 usage examples →

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...

Usage examples

Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example →

COVID-19 Open Research Dataset (CORD-19)

coronavirusCOVID-19life sciencesMERSSARS

Full-text and metadata dataset of COVID-19 and coronavirus-related research articles optimized for machine readability.

Usage examples

COVID-19 Open Research Dataset Challenge (CORD-19) by Kaggle

See 1 usage example →

Conformational Space of Short Peptides

amino acidbioinformaticsbiomolecular modelinglife sciencesmolecular dynamicsproteinstructural biology

Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides.

Usage examples

Intro to Conformational Space of Short Peptides by Sebastian Bassi and Virginia Gonzalez

See 1 usage example →

Global Biodiversity Information Facility (GBIF) Species Occurrences

biodiversitybioinformaticsconservationearth observationlife sciences

The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the world's governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...

Usage examples

GBIF and Apache-Spark on AWS tutorial by John Waller

See 1 usage example →

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca...

Usage examples

Genomic Data Commons by National Cancer Institute

See 1 usage example →

Human PanGenomics Project

cramfast5fastqgeneticgenomiclife sciences

This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.

Usage examples

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes by Shafin et al (2020)

See 1 usage example →

Knowledge Portal Network Bottom-line Genetic Associations

geneticgenome wide association studylife sciences

At the Knowledge Portal Network, we aggregate and analyze genetic association results for a wide range of diseases and traits. For any given disease, a large number of individual genetic association datasets may have been generated. To make these results more interpretable, we meta-analyze all datasets for each phenotype, using a method that we term "bottom-line integrative analysis". Here we provide the bottom-line summary statistic files for public download.

Usage examples

Tutorial: Use cases for the Knowledge Portal Network bottom-line genetic associations by Jason Flannick
Leveraging type 1 diabetes human genetic and genomic data in the T1D knowledge portal by Kudtarkar P, Costanzo MC, Sun Y, Jang D, Koesterer R, Mychaleckyj JC, et al.
Cardiovascular Disease Knowledge Portal: A Community Resource for Cardiovascular Disease Research by Costanzo MC, Roselli C, Brandes M, Duby M, Hoang Q, Jang D, et al.
The Type 2 Diabetes Knowledge Portal: An open access genetic resource dedicated to type 2 diabetes and related traits by Costanzo MC, von Grotthuss M, Massung J, Jang D, Caulkins L, Koesterer R, et al.

See 4 usage examples →

LongBench - cross-platform reference dataset profiling cancer cell lines with bulk and single-cell approaches

bambenchmarkbioinformaticscancerfastqlife scienceslong read sequencingshort read sequencingsingle-cell transcriptomicsvcf

LongBench is a comprehensive benchmark dataset of the latest long-read transcriptomics technologies from Oxford Nanopore (ON) and Pacific Biosciences, alongside a comparison with next-generation sequencing from Illumina. We generated bulk and single-cell libraries from lung cancer cell lines which include different cancer subtypes to capture real biological variation. To further compare and assess sequencing platform performance, Sequins and SIRVs (Set 4) synthetic spike-ins have been included.

Usage examples

Benchmarking long-read DE gene and transcript analysis with edgeR by Yupei You

See 1 usage example →

NCBI SRA Gene Feature RNA-Seq counts

geneticlife sciencesRNA-seqtranscriptomics

The NIH Sequence Read Archive (SRA), hosted by the [National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) stores sequencing data and alignment information from high-throughput next-generation sequencing platforms. SRA has conducted gene expression analysis of publicly released human and mouse RNA-Seq experiments to process raw RNA-seq reads into concise formats that summarize the expression results. The un-normalized feature counts for each SRA record are available in tab-delimited (*.tsv) format. The tsv files include two columns, the gene id and count...

Usage examples

The Sequence Read Archive by Leinonen et al (2011)

See 1 usage example →

NYUMets Brain Dataset

biologycancercomputer visionhealthimage processingimaginglife sciencesmachine learningmagnetic resonance imagingmedical imagingmedicineneurobiologyneuroimagingsegmentation

This dataset contains 8,000+ brain MRIs of 2,000+ patients with brain metastases.

Usage examples

Longitudinal deep neural networks for assessing metastatic brain cancer on a massive open benchmark. by Link et al (2023)

See 1 usage example →

National Herbarium of Israel

biodiversitybiologyclimatedigital preservationenvironmentalimage processingimaginglife sciences

Our collection encompasses approximately one million vascular plant specimens from the Mediterranean and Middle East biodiversity hotspot, representing flora from Israel, Jordan, Hermon, Sinai, Egypt, the Caucasus, Arabia, North Africa, and throughout the Mediterranean basin. This scientifically significant repository includes published voucher specimens, original specimens used for "Flora Palaestina" illustrations, and critical references for the Israeli gene bank collections. The ongoing digitization process captures high-resolution images of each specimen while systematically inco...

Usage examples

How to use AWS S3 bucket to explore our public images dataset by Eyal Ben-Hur

See 1 usage example →

Ohio State Cardiac MRI Raw Data (OCMR)

Homo sapiensimage processingimaginglife sciencesmagnetic resonance imagingsignal processing

OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively.

Usage examples

OCMR Tutorial by Chong Chen

See 1 usage example →

OpenNeuro

biologyimaginglife sciencesneurobiologyneuroimagingneuroscience

OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by th...

Usage examples

Accessing UCLA Consortium for Neuropsychiatric Phenomics from OpenNeuro with Scigantic by Scigantic

See 1 usage example →

Orcasound - bioacoustic data for marine conservation

biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledlife sciencesmachine learningmappingoceansopen source softwaresignal processing

Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.

Usage examples

Github for our open source projects by Orcasound open source community

See 1 usage example →

Oxford Nanopore Technologies Benchmark Datasets

bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

Usage examples

ONT Dataset Tutorials by EPI2MELabs

See 1 usage example →

RSNA Abdominal Trauma Detection (RSNA-ABT)

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...

Usage examples

The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset by Rudie, Jeffrey D.

See 1 usage example →

RSNA Abdominal Traumatic Injury CT (RATIC)

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...

Usage examples

The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset by Rudie, Jeffrey D.

See 1 usage example →

RSNA Cervical Spine Fracture Detection (RSNA-CSF) Dataset

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Over 1.5 million spine fractures occur annually in the United States alone resulting in over 17,730 spinal cord injuries annually. The most common site of spine fracture is the cervical spine. There has been a rise in the incidence of spinal fractures in the elderly and in this population, fractures can be more difficult to detect on imaging due to degenerative disease and osteoporosis. Imaging diagnosis of adult spine fractures is now almost exclusively performed with computed tomography (CT). Quickly detecting and determining the location of any vertebral fractures is essential to prevent ne...

Usage examples

The RSNA Cervical Spine Fracture CT Dataset by Ming, Hui Lin

See 1 usage example →

RSNA Intracranial Aneurysm Detection Dataset (RSNA-ICA)

computer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiology

The Radiological Society of North America Intracranial Aneurysm Detection (RSNA-ICA) dataset is a collection of over 4,000 CT brain scans annotated by a cohort of over 40 volunteer radiologists from RSNA and the American Society of Neuroradiology to show the presence and location of intracranial aneurysms. It also includes a set of about 200 imaging studies that are annotated with AI-generated segmentations highlighting abnormalities. The imaging data was provided by 18 institutions. Initially compiled in 2025 for the RSNA Intracranial Aneurysm Detection AI Challenge hosted on Kaggle competiti...

Usage examples

The RSNA Intercranial Aneurysm Detection Dataset by Authors, Various

See 1 usage example →

RSNA Intracranial Hemorrhage Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.

Usage examples

Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge by Rudie, Jeffrey D.

See 1 usage example →

RSNA Lumbar Spine Degenerative Classification Dataset (RSNA-LSDD)

computer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiology

The Radiological Society of North America Lumbar Spine Degenerative Classification dataset (RSNA-LSDD) is a collection of over 2,600 magnetic resonance imaging (MR) scans of the lumbar spine annotated by a cohort of about 60 volunteer radiologists recruited by the RSNA, the American Society for Spine Radiology and the American Society of Neuroradiology to identify the location and severity of five degenerative conditions across the five intervertebral disc levels (L1/L2, L2/L3, L3/L4, L4/L5, and L5/S1). The imaging data, comprising over 8,500 image series (Sagittal “T2”, Axial T2 and Sagittal ...

Usage examples

The RSNA Lumbar Spine Degenerative Classification Dataset by Authors, Various

See 1 usage example →

RSNA Pulmonary Embolism Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.

Usage examples

The RSNA Pulmonary Embolism CT Dataset by Colak, Errol

See 1 usage example →

Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD)

biologycell biologycell imagingepigenomicsgene expressionhistopathologyHomo sapiensimaginglife sciencesmedicinemicroscopyneurobiologyneurosciencesingle-cell transcriptomicstranscriptomics

The Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD) consortium strives to gain a deep molecular and cellular understanding of the early pathogenesis of Alzheimer's disease and is funded by the National Institutes on Aging (NIA U19AG060909). The SEA-AD datasets available here comprise single cell profiling (transcriptomics and epigenomics) and quantitative neuropathology. To explore gene expression and chromatin accessibility information, the single-cell profiling data includes: snRNAseq and snATAC-seq data from the SEA-AD donor cohort (aged brains which span the spectrum of Alzhe...

Usage examples

Seattle Alzheimer’s Disease Brain Cell Atlas by Lein, E et al.

See 1 usage example →

Single-Cell Atlas of Human Blood During Healthy Aging

life sciencesproteinsingle-cell transcriptomics

Comprehensive, large-scale single-cell profiling of healthy human blood at different ages is one of the critical pending tasks required to establish a framework for systematic understanding of human aging. Here, using single-cell RNA/TCR/BCR-seq with protein feature barcoding (20 antibodies), we profiled 317 samples from 166 healthy individuals aged 25 to 85 years old drawn over 3-year period. Dataset spanning ~2 million cells describes 50 subpopulations of blood immune cells, with 14 subpopulations changing with age, including a novel NKG2C+ CD8 Tcm population that decreases with age. We desc...

Usage examples

Single-cell atlas of healthy human blood unveils age-related loss of NKG2C+GZMB−CD8+ memory T cells and accumulation of type 2 memory T cells by Terekhova M, Swain A, Bohacova P, Aladyeva E, Arthur L, et al

See 1 usage example →

Synthea Coherent Data Set

bioinformaticscsvdicomgenomichealthimaginglife sciencesmedicine

This is a synthetic data set that includes FHIR resources, DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. FHIR links all the data types together.

Usage examples

The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive Synthetic Health Record. by Walonoski J, Hall D, Bates KM, Farris MH, Dagher J, Downs ME, Sivek RT, Wellner B, Gregorowicz A, Hadley M, Campion FX, Levine L, Wacome K, Emmer G, Kemmer A, Malik M, Hughes J, Granger E, Russell S.

See 1 usage example →

Tabula Muris

Biohubbiologyencyclopedicgenomichealthlife sciencesmedicine

Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...

Usage examples

Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. by Tabula Muris Consortium (2019)

See 1 usage example →

Tabula Muris Senis

Biohubbiologyencyclopedicgenomichealthlife sciencesmedicinesingle-cell transcriptomics

Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat...

Usage examples

Fast queries of scRNAseq datasets with Amazon Athena by Andrew Ang, James Golden, Lisa McFerrin, and Lee Pang

See 1 usage example →

UCSF Primary Central Nervous System Lymphoma MRI Dataset

brain imagescancerlife sciencesmagnetic resonance imagingmedical imagingmedicineneuroimagingradiology

This BIDS-formatted dataset provides multimodal brain MRI data from 150 patients with primary central nervous system lymphoma (PCNSL), including T1-weighted, contrast-enhanced T1-weighted, FLAIR, and ADC sequences. The dataset includes expert-annotated lesion segmentations with radiomic features, along with anonymized clinical data including demographics, diagnosis history, and medications.

Usage examples

PCNSL Data Access Tutorial by Michael Francis Romano

See 1 usage example →

UK Biobank Pharma Proteomics Project (UKB-PPP)

genome wide association studylife sciencespopulation genetics

The UKB-PPP is a collaboration between the UK Biobank (UKB) and thirteen biopharmaceutical companies characterising the plasma proteomic profiles of 54,219 UKB participants. As part of a collaborative analysis across the thirteen UKB-PPP partners, we conducted comprehensive protein quantitative trait loci (pQTL) mapping of 2,923 proteins that identifies 14,287 primary genetic associations, of which 85% are newly discovered, in addition to ancestry-specific pQTL mapping in non-Europeans. We identify independent secondary associations in 87% of cis and 30% of trans loci, expanding the catalogue ...

Usage examples

Plasma proteomic associations with genetics and health in the UK Biobank by Sun B, Chiou J, Traylor M, Benner C, Hsu Y, Richardson T, et al

See 1 usage example →

VitalDB

biologyhealthlife sciencesmedicinesignal processing

VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients.

Usage examples

VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients by Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung

See 1 usage example →

ZINC Database

biologychemical biologylife sciencesmolecular dockingpharmaceuticalprotein

3D models for molecular docking screens.

Usage examples

ZINC Database by John Irwin

See 1 usage example →

iHART Whole Genome Sequencing Data Set

autism spectrum disorderbamgeneticgenomiclife sciencesvcfwhole genome sequencing

iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).

Usage examples

Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks by Ruzzo et al. (2020)

See 1 usage example →

recount3

bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.

Usage examples

recount3 quick start guide by Leonardo Collado-Torres

See 1 usage example →

Australasian Genomes

biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the University of Sydney on behalf of our collaborators within TSI and CIPPS.

Baby Open Brains (BOBs) Repository on AWS

life sciencesmagnetic resonance imagingneuroimagingneuroscienceniftipediatricsegmentation

Manually curated and reviewed infant brain segmentations and accompanying T1w and T2w images for a range of 1-9 month old participants from the Baby Connectome Project (BCP)

Usage examples

View or Download the BOBS Repository by Lucille A. Moore
BIBsnet by Timothy J. Hendrickson et al.
BIBSNet: A Deep Learning Baby Image Brain Segmentation Network for MRI Scans by Timothy J. Hendrickson et al.

See 3 usage examples →

BioLiP

bioinformaticschemistrylife sciencesmolecular dockingmoleculeproteinstructural biology

BioLiP is a semi-manually curated database for high-quality, biologically relevant ligand-protein binding interactions. The structure data are collected primarily from the Protein Data Bank (PDB), with biological insights mined from literature and other specific databases. BioLiP aims to construct the most comprehensive and accurate database for serving the needs of ligand-protein docking, virtual ligand screening and protein function annotation.

Usage examples

BioLiP API usage by Zhang Lab
BioLiP2: an updated structure database for biologically relevant ligand-protein interactions by Chengxin Zhang, Xi Zhang, Peter L Freddolino, and Yang Zhang
BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions by Jianyi Yang, Ambrish Roy, and Yang Zhang

See 3 usage examples →

Brain Data Science Database 1

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes a variety of types of clinical neuroscience data including electroencephalography (EEG) and polysomnography (PSG) recordings, and supporting data to enable research in diverse areas of clinical neuroscience such as epilepsy, delirium, coma, and sleep medicine. All data is de-identified and includes code to reproduce results in accompanying research publications. The data is available for non-commercial resea...

Brain Data Science Database 2

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes large-scale electroencephalography (EEG) and polysomnography (PSG) recordings, brain imaging data (MRI, CT, PET), and supporting data to enable research in diverse areas of clinical neuroscience such as epilepsy, delirium, coma, sleep depth, sleep-related breathing disorders, meditation, subarachnoid hemorrhage, cardiac arrest, neuroinfectious diseases, and audiology. All data is de-identified and includes a...

Brain Data Science Database 3

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

This collection unifies multiple brain datasets spanning critical care, sleep medicine, cardiopulmonary health, infectious diseases, and other aspects of clinical neuroscience. It includes large-scale electroencephalography (EEG) and polysomnography (PSG) recordings, brain imaging data (MRI, CT, PET), and electronic health records (EHR) data supporting research in areas such as epilepsy, delirium, coma, sleep depth, sleep-related breathing disorders, burst suppression, meditation, subarachnoid hemorrhage, cardiac arrest, and neuroinfectious diseases. All data is de-identified and includes algo...

COVID-19 Molecular Structure and Therapeutics Hub

bioinformaticsbiologycoronavirusCOVID-19life sciencesmolecular dockingpharmaceutical

Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community. A community-driven data repository and curation service for molecular structures, models, therapeutics, and simulations related to computational research related to therapeutic opportunities for COVID-19 (caused by the SARS-CoV-2 coronavirus).

CartoStore

bioinformaticsgenomiclife sciencesspatial omicsspatial transcriptomics

Cross-Platform Repository for High-resolution Spatial Transcriptomics Datasets.

Usage examples

Example CartoStore Repository for Xenium Breast Cancer Dataset by Hyun Min Kang and Weiqiu Cheng
CartoStore Overview by Hyun Min Kang and Weiqiu Cheng
Cartloader Documentation by Hyun Min Kang and Weiqiu Cheng

See 3 usage examples →

Clinical Ultrasound Image Repository

life sciencesmachine learningmedical imagingmedicine

Generic Clinical Ultrasound Data from Random Subjects acquired for Clinical Reasons, to be used for Developing Artificial Intelligence Applications. This dataset is complete with 2000 studies from 2000 subjects (one third each from abdominal, cardiac, and OB/GYN cases)

DHARANI Developing Human-Brain Atlas

brain imagescomputer visionlife sciencesmicroscopyneurobiologysegmentation

We introduce DHARANI, the first online platform with three-dimensional (3D) histological reconstructions of the developing human brain from 14 to 24 gestational weeks (GW) across the five fetal brains. DHARANI features 5132 Nissl, hematoxylin and eosin stained, 20 µm coronal and sagittal sections, postmortem MRI, and a neuroanatomical atlas with 466 annotated sections covering ∼500 brain structures. It is accessible online at https://brainportal.humanbrain.in/publicview/index.html. The 3D reconstruction enables a volumetric view of the fetal brain, allowing visualization in all three planes ak...

Usage examples

See 3 usage examples →

ENHANCE.PET 1.6k - Whole-/Total-Body [18F]FDG-PET/CT with CT-Derived Segmentations

cancerlife sciencesmedical imagingniftiradiologysegmentation

Open, multi-center dataset of 1,597 whole-/total-body FDG-PET/CT studies with 130 CT-derived, expert-verified anatomical segmentations per scan (~250 GB). Provided as anonymized NIfTI (PET, CT, labels) with spreadsheet metadata. Designed for segmentation benchmarking, multi-organ analysis, radiomics, and PET/CT AI research.

Usage examples

See 3 usage examples →

GATK Test Data

bioinformaticsbiologycancergeneticgenomiclife sciences

The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute's Genome Analysis Toolkit (GATK).

GX database for NCBI Foreign Contamination Screen (FCS) Tool Suite

assemblybioinformaticsbiologycontaminationfastageneticgenomehealthlife sciencesSTRIDESwhole genome sequencing

Sequence database used by FCS-GX (Foreign Contamination Screen - Genome Cross-species aligner) to detect contamination from foreign organisms in genome sequences.

Genome Ark

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.

Gulfwide Avian Colony Monitoring Survey Photos

biologyconservationecosystemsenvironmentallabeledlife sciencesobject detection

For this project, The Water Institute (the Institute) and subcontractor Colibri Ecological Consulting, LLC (Colibri) utilized established methods and protocols capable of assessing changes of colonial waterbird populations and their important habitats within individual states and the broader northern Gulf of Mexico region. Data collection activities included: Aerial Photographic Nest Surveys: Implementation of fixed-wing aircraft surveys intended to assess waterbird colonies and document associated nesting within select portions of the northern Gulf of Mexico. Additional detail is provide...

Guy's Breast Cancer Lymph Nodes (GRAPE)

biologybreast cancercancercomputational pathologyhistopathologylife sciences

This is a retrospective dataset of 1523 H&E-stained whole slide images (WSI) of lymph nodes from breast cancer patients. The cohort consisted of 177 patients (122 LN-positive - metastasis was reported in at least 1 LN - and 55 LN-negative patients) with invasive breast carcinoma treated between 1984 and 2002 at Guy’s Hospital London, UK. Slides were scanned and digitised at 40x magnification (0.23 µm/pixel), NanoZoomer H.T2.0 2.0-HT (Hamamatsu Photonics UK, Ltd, Welwyn Garden City, UK). WSIs are in .ndpi format.

Imaging BSD licensed data and models

biodiversityBiohubbioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at Biohub and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Quickstart Tutorial for Cytoland by Biohub
Cytoland: robust virtual staining of landmark organelles by Liu, Hirata-Miyasaki, et al.
Documentation for Cytoland by Biohub

See 3 usage examples →

InRad COVID-19 X-Ray and CT Scans

bioinformaticscoronavirusCOVID-19healthlife sciencesmedicineSARS

This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.

MIMIC-IV Clinical Database Demo

healthlife sciencesmedicine

The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access r...

MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset

healthlife sciencesmedicine

The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, we provide the needed information to link the waveform to the report. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.

MetaGraph Sequence Indexes

analysis ready databiodiversitybioinformaticsbiologyfastagenomegenomicgraphinformation retrievallife sciencesmedicinemetagenomicsmicrobiometranscriptomicswhole exome sequencingwhole genome sequencing

The MetaGraph Sequence Indexes dataset comprises full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA).All index files can be used with the MetaGraph framework for sequence search. Indexes can be jointly used for aggregated search in the cloud or can be individually downloaded...

Usage examples

Usage within AWS by Oleksandr Kulkov
CloudFormation stack with a Step Function for dataset queries via AWS Batch by Oleksandr Kulkov
A global metagenomic map of urban microbiomes and antimicrobial resistance by Danko D, Bezdan D, Afshin EE, Ahsanuddin S, Bhattacharya C, Butler DJ, Chng KE, Donnellan D, Hecht J, Jackson K, Kuchin K, Karasikov M, Lyons A, Mak L, Meleshko D, Mustafa H, et al.

See 3 usage examples →

Metagenomic reference libraries for Slacken

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiome

Metagenomic indexes for use with the Slacken taxonomic classification tool

Usage examples

Precise and scalable metagenomic profiling with sample-tailored minimizer libraries by Johan Nyström-Persson, Nishad Bapatdhar and Samik Ghosh
Classifying metagenomic samples on AWS ElasticMapReduce by Johan Nyström-Persson
Slacken by Johan Nyström-Persson, Nishad Bapatdhar

See 3 usage examples →

Model Benchmarking

benchmarkBiohubbiologybiomolecular modelingcell biologylife sciencesmachine learningmodel

This dataset includes data and models relevant to benchmarking multimodal biological models. The data has been sourced and curated by a team of experts at Biohub and is provided as part of these datasets only when it is not publicly available or requires transformation to support effective model benchmarking.

Usage examples

Evaluating SubCell and Related Imaging Models by Biohub
The molecular evolution of spermatogenesis across mammals by Murat, F., et al.
Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues by Tabula Sapiens Consortium et al.

See 3 usage examples →

Nanopore Reference Human Genome

geneticgenomiclife scienceswhole genome sequencing

This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.

Natural Scenes Dataset

computer visionimage processingimaginglife sciencesmachine learningmagnetic resonance imagingneuroimagingneurosciencenifti

Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retin...

OpenWings OpenData

biodiversityfastqgeneticgenomelife sciencesmuseumwildlife

DNA sequence data of UCE loci collected from the world's bird species (n=10,560).

Usage examples

Ultraconserved elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales by BC Faircloth, JE McCormack, NG Crawford, MG Harvey, RT Brumfield, TC Glenn
Tutorial I - UCE Phylogenomics by Brant Faircloth
phyluce by Brant Faircloth

See 3 usage examples →

Opioid Industry Documents Archive (OIDA) Data on AWS

archiveslife sciencespharmaceuticaltext analysistxt

The OIDA Data on AWS contain the metadata, documents, and extracted text for all of the documents in the UCSF-JHU Opioid Industry Documents Archive, a growing corpus of internal corporate records and other documents arising from the opioid industry.

Physionet

biologylife sciences

PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).

RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset

breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology

According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...

SocialGene RefSeq Databases

amino acidbioinformaticschemical biologygenomicgraphlife sciencesmetagenomicsmicrobiomepharmaceuticalprotein

Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.

Usage examples

See 3 usage examples →

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

The University of California San Francisco Brain Metastases Stereotactic Radiosurgery (UCSF-BMSR) MRI Dataset

cancerlife sciencesmagnetic resonance imagingmedical imagingmedicineradiology

The University of California San Francisco Brain Metastases Stereotactic Radiosurgery (UCSF-BMSR) dataset is a public, clinical, multimodal brain MRI dataset consisting of 560 brain MRIs from 412 patients with expert annotations of 5136 brain metastases. Data consists of registered and skull stripped T1 post-contrast, T1 pre-contrast, FLAIR and subtraction (T1 pre-contrast - T1 post-contrast) images and voxelwise segmentations of enhancing brain metastases in NifTI format.

UCSC Genome Browser Sequence and Annotations

bioinformaticsbiologygeneticgenomiclife sciences

The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...

USearch Molecules

biologychemical biologylife sciencespharmaceutical

Collection of 7 billion small molecules in SMILES notation with 28 billion fingerprints, including MACCS, ECFP4, FCFP4, and PubChem, with pre-constructed USearch indexes over them.

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflower's genetic diversity originating from thousands of wild, cultivated, and landrace sunflower individuals distributed across North America.The data consists of raw sequences and associated botanical metadata, aligned sequences (to three different reference genomes), and sets of SNPs computed across several cohorts.

Will Two Do? Varying Dimensions in Electrocardiography: The PhysioNet/Computing in Cardiology Challenge 2021

healthlife sciencesmedicine

The electrocardiogram (ECG) is a non-invasive representation of the electrical activity of the heart. Although the twelve-lead ECG is the standard diagnostic screening system for many cardiological issues, the limited accessibility of twelve-lead ECG devices provides a rationale for smaller, lower-cost, and easier to use devices. While single-lead ECGs are limiting [1], reduced-lead ECG systems hold promise, with evidence that subsets of the standard twelve leads can capture useful information [2], [3], [4] and even be comparable to twelve-lead ECGs in some limited contexts. In 2017 we challen...

iNaturalist Licensed Observation Images

biodiversitybioinformaticsconservationearth observationlife sciences

iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.

stdpopsim species resources

genetic mapslife sciencespopulation geneticsrecombination mapssimulations

Contains all resources (genome specifications, recombination maps, etc.) required for species specific simulation with the stdpopsim package. These resources are originally from a variety of other consortium and published work but are consolidated here for ease of access and use. If you are interested in adding a new species to the stdpopsim resource please raise an issue on the stdpopsim GitHub page to have the necessary files added here.

GenomeKit genomic data

bioinformaticsgenomegenomicHomo sapienslife sciencesMus musculusnon-human primateopen source softwareRattus norvegicusvariant annotation

GenomeKit is Deep Genomics’ Python library for fast and easy access to genomic resources such as sequence, data tracks, and annotations. The goal is to let machine learning researchers build data sets easily, and to be creative about how those data sets are designed. Out of the box, GenomeKit provides access to pre-built optimized genomic data files that are required for its operation.

Usage examples

See 2 usage examples →

NASA SOTERIA Simulation Testbed Data

life sciencesneuroimagingtransportationworkload analysis

Commercial pilot simulation data during safety-of-flight scenarios.

Usage examples

Python Processing Code by Tyler Fettrow
SOTERIA Simulation - Experimental Methods, Data Processing, and Data Quality by Tyler Fettrow, Chad Stephens, Lance Prinzel, Jon Holbrook, Sepher Bastami, Michael Stewart, Kathryn Ballard, Daniel Kiggins

See 2 usage examples →

Platinum Pedigree

bioinformaticsgenomicgenotypingHomo sapienslife scienceslong read sequencingwhole genome sequencing

The Platinum Pedigree Consortium (PCC) is a collaborative project to create a comprehensive reference for human genetic variation using a four-generation, 28-member family (CEPH-1463). We employed five different short and long-read sequencing technologies to generate phased assemblies and characterize both inherited and de novo variation, including at some of the most difficult to genotype genomic regions such as tandem repeats, centromeres, and the Y chromosome. This extensive "truth set" is publicly available and can be used to test and benchmark new algorithms and technologies to ...

Usage examples

See 2 usage examples →

1KG-ONT-VIENNA panel

fast5fastqgeneticgenomiclife scienceswhole genome sequencing

The 1KG-ONT-VIENNA panel comprises medium coverage ONT sequencing data for 1.019 samples from the 1000 Genomes Project collection, structural variants, and their haplotype context.

Usage examples

Long-read sequencing and structural variant characterization in 1,019 samples from the 1000 Genomes Project by Siegfried Schloissnig, Samarendra Pani, Bernardo Rodriguez-Martin, Jana Ebler, Carsten Hain, Vasiliki Tsapalou, Arda Söylev, Patrick Hüther, Hufsah Ashraf, Timofey Prodanov, Mila Asparuhova, Sarah Hunt, Tobias Rausch, Tobias Marschall, Jan O Korbel

See 1 usage example →

AWS iGenomes

agricultureamazon.sciencebiologyCaenorhabditis elegansDanio reriogeneticgenomicHomo sapienslife sciencesMus musculusRattus norvegicusreference index

Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.

Usage examples

nf-core analysis pipelines by Phil Ewels

See 1 usage example →

AllTheBacteria

assemblybacteriabioinformaticsfastagenomiclife sciencesmicrobial genomicsshort read sequencingwhole genome sequencing

All bacterial isolate whole-genome sequencing data from INSDC, uniformly assembled, quality-controlled, annotated, and searchable.

Usage examples

AllTheBacteria - all bacterial genomes assembled, available and searchable by Hunt M, Lima L, Anderson D, Hawkey J, Shen W, Lees J, Iqbal I

See 1 usage example →

Boltz-1 Training Data

deep learninglife sciencesmolecular dockingopen source softwareprotein folding

This is the data used to train the Boltz-1 model. It contains the following datasets:

Our pre-processed version of the Protein Data Bank
Our pre-processed version of the multiple sequence alignment data for each protein chain
The raw multiple sequence alginment data.
A pre-computed symmetry file for symmetry correction during training

Usage examples

Boltz-1: Democratizing Biomolecular Interaction Modeling by J Wohlwend, G Corso, S Passaro, M Reveiz, K Leidal, W Swiderski, T Portnoi, I Chinn, J Silterra, T Jaakkola, R Barzilay

See 1 usage example →

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Usage examples

An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example →

Ocean Biodiversity Information System (OBIS) species occurrence data

biodiversitycoastalconservationecosystemsenvironmentalgeospatiallife sciencesoceanswater

The Ocean Biodiversity Information System (OBIS) was founded in 2000 under the Census of Marine Life. It is now a programme component of the International Oceanographic Data and Information Exchange (IODE) programme of the Intergovernmental Oceanographic Commission (IOC) of UNESCO. OBIS aims to be the most comprehensive data and information gateway on the diversity, distribution and abundance of marine life to support its Member States in achieving a healthy and resilient ocean ecosystem. The OBIS network consists of over 30 regional and thematic nodes, and provides access to more than 5,000 d...

Usage examples

Querying OBIS occurrence data using Amazon Athena by Pieter Provoost

See 1 usage example →

OceanOmics

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

Minderoo Foundation OceanOmics aims to establish environmental DNA (eDNA) as a tool to measure, understand, and protect oceans. OceanOmics mainly generates two types of data: eDNA sequencing data (metabarcoding, metagenomics), and genome assembly data (marine vertebrates).

Usage examples

Case-studies on using OceanOmics genomes and eDNA data by Philipp Bayer

See 1 usage example →

Blue Brain Open Data

brain imagesbrain modelselectrophysiologyion channelslife sciencesmicrocircuit modeling and simulationmorphological reconstructionsMus musculusneurosciencesimulation neurosciencesingle neuron models

The Blue Brain Open Data represents an extensive neuroscience dataset encompassing a diverse range of data types, including experimental, model, and simulation data, along with images and videos depicting reconstructed neurons and brain regions.