Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with genetic.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with genetic.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, and 4.2

bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing

This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural vari...

Usage examples

Illumina Connected Analytics by Illumina Inc.
DRAGEN Sets New Standard for Data Accuracy in PrecisionFDA Benchmark Data. Optimizing Variant Calling Performance with Illumina Machine Learning and DRAGEN Graph by Illumina Inc. (2022)
Nirvana Documentation by Illumina Inc.
Overcoming high homology to detect variation in CYP21A2 with whole-genome sequencing in DRAGEN by Illumina Inc. (2023)
Demystifying the versions of GRCh38/hg38 reference genomes, how they are used in DRAGEN and their impact on accuracy by Illumina Inc. (2021)

See 22 usage examples →

Gabriella Miller Kids First Pediatric Research Program (Kids First)

cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing

The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...

Usage examples

Clinically Relevant and Minimally Invasive Tumor Surveillance of Pediatric Diffuse Midline Gliomas Using Patient-Derived Liquid Biopsy by Eshini Panditharatna, Lindsay B Kilburn, et al.
Development and Clinical Validation of a Large Fusion Gene Panel for Pediatric Cancers. by Fengqi Chang, Fumin Lin, et al.
Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing. by Allison H Seiden, Felix Richter, et al.
Genomic Analyses Implicate Noncoding De Novo Variants in Congenital Heart Disease. by Felix Richter, Sarah U Morton, et al.
Decreased ACKR3 (CXCR7) function causes oculomotor synkinesis in mice and humans. by Mary C Whitman, Noriko Miyake, et al.

See 19 usage examples →

Cell Painting Gallery

bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle

The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute. A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...

Usage examples

Image-based Profiling Handbook - for processing image-based profiling datasets using CellProfiler and pycytominer by Multiple Authors
Image-based profiling introductory exercise - data and an exercise on exploring image-based profiles, including understanding the various data levels by Beth Cimini
Systematic morphological profiling of human gene and allele function via Cell Painting by Rohban MH, Singh S, Wu X, Berthet JB, Bray M-A, Shrestha Y, Varelas X, Boehm JS, & Carpenter AE
Multiplex Cytological Profiling Assay to Measure Diverse Cellular States by Gustafsdottir SM, Ljosa V, Sokolnicki KL, Wilson JA, Walpita D, Kemp MM, Seiler KP, Carrel HA, Golub TR, Schreiber SL, Clemons PA, Carpenter AE, and Shamji AF
Accelerating Drug Discovery with high-throughput Cell Painting on AWS by Chris Kaspar

See 17 usage examples →

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies. The gnomAD Principal Investigators and team can be found here, and the groups that have contributed data to the current release are listed here. Sign up for the gnom...

Usage examples

Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016) by Lek, M., Karczewski, K., Minikel, E. et al.
gnomAD v3.0 by Laurent Francioli, Daniel MacArthur
Evaluating potential drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020) by Minikel, E. V., Karczewski, K. J., Martin, H. C., Cummings, B. B., Whiffin, N., Rhodes, D., Alföldi, J., Trembath, R. C., van Heel, D. A., Daly, M. J., Genome Aggregation Database Production Team, Genome Aggregation Database Consortium, Schreiber, S. L., & MacArthur, D. G.
The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020) by Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., Gauthier, L. D., Brand, H., Solomonson, M., Watts, N. A., Rhodes, D., Singer-Berk, M., England, E. M., Seaby, E. G., Kosmicki, J. A., ... MacArthur, D. G.
gnomAD v2.1 by Laurent Francioli, Grace Tiao, Konrad Karczewski, Matthew Solomonson, Nick Watts

See 15 usage examples →

PubSeq - Public Sequence Resource

bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL

COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.

Usage examples

See 9 usage examples →

Cancer Cell Line Encyclopedia (CCLE)

cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.

Usage examples

See 8 usage examples →

NIH Roadmap Epigenomics

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The project has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility, DNA methylation and mRNA expression across 100s of human cell types and tissues. To see what data is available, please check the directory listing: https://roadmapepigenomics.s3.us-west-2.amazonaws.com/index.html.

Usage examples

Navigation of Roadmap data using Roadmap web portal by Anshul Kundaje Lab
WashU Epigenome Browser update 2019 by Daofeng Li, Silas Hsu, Deepak Purushotham, Renee L Sears and Ting Wang
Human body epigenome maps reveal noncanonical DNA methylation variation by Matthew D. Schultz, Yupeng He, John W. Whitaker, Manoj Hariharan, Eran A. Mukamel, Danny Leung, Nisha Rajagopal, Joseph R. Nery, Mark A. Urich, Huaming Chen, Shin Lin, Yiing Lin, Inkyung Jung, Anthony D. Schmitt, Siddarth Selvaraj, Bing Ren, Terrence J. Sejnowski, Wei Wang & Joseph R. Ecker
Integrative analysis of 111 reference human epigenomes by Roadmap Epigenomics Consortium, Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour etc.al, Ting Wang, Manolis Kellis
The epigenomic landscape of transposable elements across normal human development and anatomy by Erica C. Pehrsson, Mayank N. K. Choudhary, Vasavi Sundaram & Ting Wang

See 8 usage examples →

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Usage examples

Metabolic effects of air pollution exposure and reversibility by Rajagopalan S, Park B, Palanivel R, et al.
The role of environmental exposures and the epigenome in health and disease. by Perera BPU, Faulk C, Svoboda LK, Goodrich JM, Dolinoy DC.
Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al.
Visualize TaRGET II data with WashU Epigenome Browser by WashU Epigenome Browser
Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S.

See 8 usage examples →

CIViC (Clinical Interpretation of Variants in Cancer)

cancergeneticgenomiclife sciencesvcf

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...

Usage examples

See 7 usage examples →

ICGC on AWS

bamcancergeneticgenomiclife sciencesvcf

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Usage examples

See 7 usage examples →

Logan Unitigs and Contigs of the Sequence Read Archive (SRA) on AWS

fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...

Usage examples

See 7 usage examples →

Open Bioinformatics Reference Data for Galaxy

bioinformaticsbiologygeneticgenomiclife sciencesreference index

This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...

Usage examples

Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
Bioconductor by Bioconductor Project
Accessible, curated metagenomic data through ExperimentHub by Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages by Tiago C. Silva, Antonio Colaprico, Catharina Olsen, Fulvio D'Angelo, Gianluca Bontempi, Michele Ceccarelli, Houtan Noushmehr
Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko

See 6 usage examples →

Serratus: Ultra-deep Search for Novel Viruses - Versioned Data Release

bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus

Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.

Usage examples

Ribovirus classification by a polymerase barcode sequence by Babaian A., and Edgar R. (2021)
Serratus Explorer by Serratus Team
coronaSPAdes. From biosynthetic gene clusters to RNA viral assemblies by Meleshko D., Hajirasouliha I., and Korobeynikov A. (2021)
Tantalus: An R Package for exploration of Serratus data by Serratus Team
Diversification of mammalian deltaviruses by host shifting by Bergner L.M., Orton R.J., et al (2021)

See 6 usage examples →

3000 Rice Genomes Project

agriculturefood securitygeneticgenomiclife sciences

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.

Usage examples

Identification and Allele Combination Analysis of Rice Grain Shape-Related Genes by Genome-Wide Association Study by Meng B et al (2022)
Structural variants in 3000 rice genomes by Fuentes RR et al (2019)
RiceGalaxy by International Rice Research Institute
Rice Galaxy: an open resource for plant science by Juanillas V et al (2019)
Tracking the origin of two genetic components associated with transposable element bursts in domesticated rice by Chen J et al (2019)

See 5 usage examples →

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...

Usage examples

"Interim Analysis of the Mmrf Commpass Trial: Identification of Novel Rearrangements Potentially Associated with Disease Initiation and Progression" by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, Jessica Aldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer, Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD
Genomic Data Commons by National Cancer Institute
"Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple Myeloma Relating Clinical Outcomes To Genomic and Immunophenotypic Profiles" by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, Harrison B, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRF CoMMpass Network, Capone W, Carpten J, Lonial S.
"Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysis of the Mmrf CoMMpass Study" by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, Jessica Aldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma C Gutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, Seungchan Kim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Daniel Auclair, PhD, Sagar Lonial, MD FACP
"Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An Interim Analysis of the Mmrf Commpass Study" by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, Kristi Stephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M. Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, Winnie Liang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Sagar Lonial, MD

See 5 usage examples →

NIH NCBI Sequence Read Archive (SRA) on AWS

bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...

Usage examples

See 5 usage examples →

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Usage examples

BLAST+ Docker by NCBI BLAST
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden

See 4 usage examples →

Encyclopedia of DNA Elements (ENCODE)

bioinformaticsbiologygeneticgenomiclife sciences

The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a ...

Usage examples

See 4 usage examples →

Genome in a Bottle on AWS

geneticgenomiclife sciencesreference indexvcf

Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.

Usage examples

GA4GH Benchmarking Tools by GA4GH Benchmarking Team
The Genome in a Bottle Github Project by Genome In A Bottle Consortium
High-coverage, long-read sequencing of Han Chinese trio reference samples by Wang Y et al (2019)
Extensive sequencing of seven human genomes to characterize benchmark reference materials by Zook J et al (2016)

See 4 usage examples →

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Usage examples

See 4 usage examples →

UK Biobank Linkage Disequilibrium Matrices

geneticgenome wide association studygenomiclife sciencespopulation genetics

Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.

Usage examples

PolyFun Wiki by Omer Weissbrod
Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores by Weissbrod et al.
PolyFun and PolyPred software by Omer Weissbrod
Functionally informed fine-mapping and polygenic localization of complex trait heritability by Weissbrod et al.

See 4 usage examples →

UK Biobank Pan-Ancestry Summary Statistics

geneticgenome wide association studygenomiclife sciencespopulation genetics

A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.

Usage examples

Hail on AWS Quick Start by Amazon Web Services and PrivoIT
Hail by Hail Team
Pan-ancestry genetic analysis of the UK Biobank by Pan UKBB Team
Hail Tutorials by Hail Team

See 4 usage examples →

Allen Ivy Glioblastoma Atlas

biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology

This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.

Usage examples

See 3 usage examples →

Allen Mouse Brain Atlas

biologygene expressiongeneticimage processingimaginglife sciencesMus musculusneurobiologytranscriptomics

The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an...

Usage examples

See 3 usage examples →

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Usage examples

Clinical resistance to crenolanib in acute myeloid leukemia due to diverse molecular mechanisms by Zhang H, Savage S, Schultz AR, Bottomly D, White L, Segerdell E, et al.
Functional Genomic Landscape of Acute Myeloid Leukemia by Jeffrey W. Tyner, Cristina E. Tognon, Dan Bottomly et al.
Genomic Data Commons by National Cancer Institute

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Usage examples

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Usage examples

The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon
New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.

See 3 usage examples →

1000 Genomes

fastqgeneticgenomiclife scienceswhole genome sequencing

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Usage examples

Examine genomic variation across populations with AWS by Konstantinos Tzouvanas
Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR by Alyssa Marrow

See 2 usage examples →

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...

Usage examples

See 2 usage examples →

Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset

fluorescence imagingGeneLabgeneticgenetic mapslife sciencesmicroscopyNASA SMD AI

Fluorescence microscopy images of individual nuclei from mouse fibroblast cells, irradiated with Fe particles or X-rays with fluorescent foci indicating 53BP1 positivity, a marker of DNA damage. These are maximum intensity projections of 9-layer microscopy Z-stacks.

Usage examples

Dose, LET and Strain Dependence of Radiation-Induced 53BP1 Foci in 15 Mouse Strains Ex Vivo Introducing Novel DNA Damage Metrics by Sébastien Penninckx, Egle Cekanaviciute, Charlotte Degorre, Elodie Guiet, Louise Viger, Stéphane Lucasb, Sylvain V. Costes
NASA SMD AI Workshop Report by SMD Artificial Intelligence (AI) Initiative

See 2 usage examples →

Biological and Physical Sciences (BPS) RNA Sequencing Benchmark Training Dataset

gene expressionGeneLabgeneticgenetic mapslife sciencesNASA SMD AIspace biology

RNA sequencing data from spaceflown and control mouse liver samples, sourced from NASA GeneLab and augmented with generative adversarial network.

Usage examples

NASA SMD AI Workshop Report by SMD Artificial Intelligence (AI) Initiative
Adversarial generation of gene expression data by Ramon Viñas, Helena Andrés-Terré, Pietro Liò, Kevin Bryson

See 2 usage examples →

Broad Genome References

bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index

Broad maintained human genome reference builds hg19/hg38 and decoy references.

Usage examples

Using Amazon FSx for Lustre for Genomics Workflows on AWS by W. Lee Pang
Advancing NGS quality control to enable measurement of actionable mutations in circulating tumor DNA by Willey J. C., Morrison T. B., Austermiller B., Crawford E. E., et al (2021)

See 2 usage examples →

DNAStack COVID19 SRA Data

bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...

Usage examples

Viral lineage assignment by Heather Ward
Viral AI by DNAstack

See 2 usage examples →

GATK Structural Variation (SV) Data

bioinformaticsbiologycromwellgatk-svgeneticgenomiclife sciencesstructural variation

This dataset holds the data needed to run a structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data in AWS.

Usage examples

Structural Variant Analysis on AWS with Amazon FSx for Lustre by Goldfinch Bio and Loka Inc.
AWS Setup & Execution by Goldfinch Bio and Loka Inc.

See 2 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Usage examples

See 2 usage examples →

OpenCRAVAT

geneticgenomiclife sciencessqlitetertiary analysisvariant annotation

OpenCRAVAT is a module variant annotation tool developed by KarchinLab at Johns Hopkins. This dataset is a mirror of the OpenCRAVAT store available at https://store.opencravat.org. You can configure OpenCRAVAT to use this mirror by editing the "cravat-system.yml" file. The path to this file is in the first output line of the command "oc config system". In that file, change the value of "store_url" to "https://opencravat-store-aws.s3.amazonaws.com".

Usage examples

Changing the OpenCRAVAT store url by Kyle Moad
OpenCRAVAT by Karchinlab

See 2 usage examples →

Pancreatic Cancer Organoid Profiling

cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Usage examples

Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer by Tiriac H, Belleau P, Engle DD, Plenker D, Deschênes A, Somerville TD, et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

Reference data for HiFi human WGS

genetichealthHomo sapienslife scienceslong read sequencingmappingvariant annotationvcfwhole genome sequencing

Reference data bundle for analyzing HiFi human whole genome sequencing data

Usage examples

See 2 usage examples →

Tabula Sapiens

biologyencyclopedicgeneticgenomichealthlife sciencesmedicinesingle-cell transcriptomics

Tabula Sapiens is a benchmark, first-draft human cell atlas of over 1.1M cells from 28 organs of 24 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments. We...

Usage examples

Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues by The Tabula Sapiens Consortium, Stephen R Quake
The Tabula Sapiens: a multiple organ single cell transcriptomic atlas of humans by The Tabula Sapiens Consortium

See 2 usage examples →

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...

Usage examples

Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example →

Human PanGenomics Project

cramfast5fastqgeneticgenomiclife sciences

This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.

Usage examples

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes by Shafin et al (2020)

See 1 usage example →

Knowledge Portal Network Bottom-line Genetic Associations

geneticgenome wide association studylife sciences

At the Knowledge Portal Network, we aggregate and analyze genetic association results for a wide range of diseases and traits. For any given disease, a large number of individual genetic association datasets may have been generated. To make these results more interpretable, we meta-analyze all datasets for each phenotype, using a method that we term "bottom-line integrative analysis". Here we provide the bottom-line summary statistic files for public download.

Usage examples

The Type 2 Diabetes Knowledge Portal: An open access genetic resource dedicated to type 2 diabetes and related traits by Costanzo MC, von Grotthuss M, Massung J, Jang D, Caulkins L, Koesterer R, et al.
Tutorial: Use cases for the Knowledge Portal Network bottom-line genetic associations by Jason Flannick
Leveraging type 1 diabetes human genetic and genomic data in the T1D knowledge portal by Kudtarkar P, Costanzo MC, Sun Y, Jang D, Koesterer R, Mychaleckyj JC, et al.
Cardiovascular Disease Knowledge Portal: A Community Resource for Cardiovascular Disease Research by Costanzo MC, Roselli C, Brandes M, Duby M, Hoang Q, Jang D, et al.

See 4 usage examples →

iHART Whole Genome Sequencing Data Set

autism spectrum disorderbamgeneticgenomiclife sciencesvcfwhole genome sequencing

iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).

Usage examples

Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks by Ruzzo et al. (2020)

See 1 usage example →

recount3

bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.

Usage examples

recount3 quick start guide by Leonardo Collado-Torres

See 1 usage example →

Australasian Genomes

biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the University of Sydney on behalf of our collaborators within TSI and CIPPS.

GATK Test Data

bioinformaticsbiologycancergeneticgenomiclife sciences

The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute's Genome Analysis Toolkit (GATK).

GX database for NCBI Foreign Contamination Screen (FCS) Tool Suite

assemblybioinformaticsbiologycontaminationfastageneticgenomehealthlife scienceswhole genome sequencing

Sequence database used by FCS-GX (Foreign Contamination Screen - Genome Cross-species aligner) to detect contamination from foreign organisms in genome sequences.

Genome Ark

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.

Nanopore Reference Human Genome

geneticgenomiclife scienceswhole genome sequencing

This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.

OpenWings OpenData

biodiversityfastqgeneticgenomelife sciencesmuseumwildlife

DNA sequence data of UCE loci collected from the world's bird species (n=10,560).

Usage examples

Tutorial I - UCE Phylogenomics by Brant Faircloth
Ultraconserved elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales by BC Faircloth, JE McCormack, NG Crawford, MG Harvey, RT Brumfield, TC Glenn
phyluce by Brant Faircloth

See 3 usage examples →

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

UCSC Genome Browser Sequence and Annotations

bioinformaticsbiologygeneticgenomiclife sciences

The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflower's genetic diversity originating from thousands of wild, cultivated, and landrace sunflower individuals distributed across North America.The data consists of raw sequences and associated botanical metadata, aligned sequences (to three different reference genomes), and sets of SNPs computed across several cohorts.

1KG-ONT-VIENNA panel

fast5fastqgeneticgenomiclife scienceswhole genome sequencing

The 1KG-ONT-VIENNA panel comprises medium coverage ONT sequencing data for 1.019 samples from the 1000 Genomes Project collection, structural variants, and their haplotype context.

Usage examples

Long-read sequencing and structural variant characterization in 1,019 samples from the 1000 Genomes Project by Siegfried Schloissnig, Samarendra Pani, Bernardo Rodriguez-Martin, Jana Ebler, Carsten Hain, Vasiliki Tsapalou, Arda Söylev, Patrick Hüther, Hufsah Ashraf, Timofey Prodanov, Mila Asparuhova, Sarah Hunt, Tobias Rausch, Tobias Marschall, Jan O Korbel

See 1 usage example →

AWS iGenomes

agricultureamazon.sciencebiologyCaenorhabditis elegansDanio reriogeneticgenomicHomo sapienslife sciencesMus musculusRattus norvegicusreference index

Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.

Usage examples

nf-core analysis pipelines by Phil Ewels

See 1 usage example →

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Usage examples

An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example →

OceanOmics

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

Minderoo Foundation OceanOmics aims to establish environmental DNA (eDNA) as a tool to measure, understand, and protect oceans. OceanOmics mainly generates two types of data: eDNA sequencing data (metabarcoding, metagenomics), and genome assembly data (marine vertebrates).

Usage examples

Case-studies on using OceanOmics genomes and eDNA data by Philipp Bayer

See 1 usage example →