Amazon Web Services is collaborating with Chan Zuckerberg Biohub to share datasets in cell biology to help scientists study how cells operate, organize, and work as part of systems to understand why disease happens and how to correct it.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
bioinformaticscell biologylife sciencessingle-cell transcriptomicstranscriptomics
CZ CELLxGENE Discover (cellxgene.cziscience.com) is a free-to-use platform for the exploration, analysis, and retrieval of single-cell data. CZ CELLxGENE Discover hosts the largest aggregation of standardized single-cell data from the major human and mouse tissues, with modalities that include gene expression, chromatin accessibility, DNA methylation, and spatial transcriptomics. This year, CZ CELLxGENE Discover has made available all of its human and mouse RNA single-cell data through Census (https://chanzuckerberg.github.io/cellxgene-census/) – a free-to-use service with an API and data that>>...
biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy
The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.
biodiversitybioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingcziimaginglife sciencesmachine learningmicroscopymodelproteinzarr
This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.
biodiversitybiologybiomolecular modelingcell biologyczihdf5life sciencesmachine learningmodelproteintranscriptomics
This dataset contains a transcriptomics biological data and models. The models embed transcriptomic data and facilitate transcriptomic analysis. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.
cell biologycryo electron tomographyczielectron tomographylife sciencesmachine learningsegmentationstructural biology
Cryo-electron tomography (cryoET) is a powerful technique for visualizing 3D structures of cellular macromolecules at near atomic resolution in their native environment. Observing the inner workings of cells in context enables better understanding about the function of healthy cells and the changes associated with disease. However, the analysis of cryoET data remains a significant bottleneck, particularly the annotation of macromolecules within a set of tomograms, which often requires a laborious and time-consuming process of manual labelling that can take months to complete. Given the current...
biologyencyclopedicgeneticgenomichealthlife sciencesmedicinesingle-cell transcriptomics
Tabula Sapiens is a benchmark, first-draft human cell atlas of over 1.1M cells from 28 organs of 24 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments. We...
biologyencyclopedicgenomichealthlife sciencesmedicine
Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...
biologyencyclopedicgenomichealthlife sciencesmedicinesingle-cell transcriptomics
Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat...
biodiversitybioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingcziimaginglife sciencesmachine learningmicroscopymodelproteinzarr
This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.
benchmarkbiologybiomolecular modelingcell biologyczilife sciencesmachine learningmodel
This dataset includes data and models relevant to benchmarking multimodal biological models. The data has been sourced and curated by a team of experts at CZI and is provided as part of these datasets only when it is not publicly available or requires transformation to support effective model benchmarking.