About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with genomic.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with genomic.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

The Cancer Genome Atlas

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Usage examples

Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context by Hua-Sheng Chiu, Sonal Somvanshi, et al.
Molecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human Cancers by Xinxin Peng, Zhongyuan Chen, et al.
An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics by Jianfang Liu, Tara Lichtenberg, et al.
Spatial Organization And Molecular Correlation Of Tumor-Infiltrating Lymphocytes Using Deep Learning On Pathology Images by Joel Saltz, Rajarsi Gupta, et al.
GDC Legacy Archive by National Cancer Institute

See 29 usage examples →

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...

Usage examples

Biomarker significance of plasma and tumor miR-21, miR-221, and miR-106a in osteosarcoma by Nakka M, Allen-Rhoades W, Li Y, et al.
The genetic landscape of high-risk neuroblastoma by Pugh TJ, Morozova O, Attiyeh EF, Asgharzadeh S, Wei JS, Auclair D, Carter SL, Cibulskis K, Hanna M, Kiezun A, Kim J, Lawrence MS, Lichenstein L, et al.
The molecular landscape of pediatric acute myeloid leukemia reveals recurrent structural alterations and age-specific mutational interactions by Bolouri H, Farrar JE, Triche T Jr, et al.
Relapsed neuroblastomas show frequent RAS-MAPK pathway mutations by Eleveld TF, Oldridge DA, Bernard V, Koster J, et al.
ISB Cancer Genomics Cloud by Institute for Systems Biology

See 24 usage examples →

Gabriella Miller Kids First Pediatric Research Program (Kids First)

cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing

The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...

Usage examples

Kids First DRC Portal by Kids First DRC
Genome-wide Enrichment of De Novo Coding Mutations in Orofacial Cleft Trios. by Madison R Bishop, Kimberly K Diaz Perez, et al.
Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing. by Allison H Seiden, Felix Richter, et al.
Kids First DRC Source Code by Kids First DRC
Whole genome sequencing of orofacial cleft trios from the Gabriella Miller Kids First Pediatric Research Consortium identifies a new locus on chromosome 21. by Nandita Mukhopadhyay, Madison Bishop, et al.

See 19 usage examples →

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, 4.2, and 4.4

bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing

Description

Overivew

This dataset contains alignment files and small variant (includes single nucleotide variants (SNV) and indels), copy number variant (CNV), short tandem repeat (i.e., repeat expansion; STR), structural variant (SV) and other variant call files from the 1000 Genomes Project (1KGP) Phase 3 dataset (3,202 individuals, 602 trios) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, v4.2.7, and v4.4.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6, v4.2.7, and v4.4.7 datasets include results from trio small variant, de novo structural variant, and de novo copy number variant calls on 602 trio families comprised of members from the 1KGP Phase 3 dataset. Trio repeat expansion calling was included in the v3.7.6 dataset only. Joint cohort analysis was also performed on the entire 1KGP sample dataset for the v3.7.6, v4.0.3, v4.2.7, and v4.4.7 re-analyses using DRAGEN Iterative gVCF Genotyper v3.8.3, v4.2.0, v4.2.7, v4.4.7, respectively (see 'Genotyping variants at population scale using DRAGEN gVCF Genotyper' and 'Population Genotyping').

DRAGEN Versions

v3.7

User Guide | Release NotesImprovements and new features in the v3.7.6 individual samples analyses include CYP2D6 variant calling (see 'Overcoming high homology to detect variation in CYP21A2 with whole-genome sequencing in DRAGEN') and joint detection and use of graph-based hg19 and hg38 reference hash tables (see 'DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes' and 'Demystifying the versions of GRCh38/hg38 reference genomes, how they are used in DRAGEN and their impact on accuracy' for details).

v4.0

User Guide | Release NotesThe DRAGEN v4.0.3 dataset features improved small variant calling accuracy due to utilization of a newly integrated machine learning functionality with an updated graph based reference for difficult to map regions (see 'DRAGEN Sets New Standard for Data Accuracy in PrecisionFDA Benchmark Data. Optimizing Variant Calling Performance with Illumina Machine Learning and DRAGEN Graph'); accuracy and runtime improvements in the SV caller; new targeted callers including CYP2B6, GBA, SMN and a Star Allele PGx caller; and an expanded catalog for use with Expansion Hunter STR caller.

v4.2

User Guide | Release NotesDRAGEN v4.2.7 offers significant accuracy improvements in small variant, CNV, and SV calling, includes new targeted callers (HBA, LPA, RH, CYP21A2, SMN silent carrier variant), and supports Star Allele calling for five additional pharmacogenes (BCHE, ABCG2, NAT2, F5, and UGT2B17). These are further improved by upgraded machine learning models. See DRAGEN 4.2: Enhanced machine learning, new targeted callers, and more for further details on these and other enchancements.

v4.4

User Guide | Release NotesDRAGEN v4.4.7 boosts the speed and accuracy of all callers via the official release of an optimized pangenome graph reference ('The quest for accuracy gains in the dark regions of the genomes: Presenting the DRAGEN multigenome mapper and pangenome reference updates in version 4.3'). Namely, SV calling accuracy is substantially increased via the implementation of a multigenome mapper capable of exploiting the power of a pangenome reference. Runtime is further reduced by supporting AWS F2 EC2 instances (Enabling Rapid Genomic and Multiomic Data Analysis with Illumina DRAGEN™ v4.4 on Amazon EC2 F2 Instances)

Annotation

Starting with the v4.0.3 reanalysis, annotation using the Illumina Connected Annotations (also known as Illumina Annotation Engine or Nirvana) was included as part of the analysis (see Illumina Connected Annotations documentation ...

Usage examples

Nirvana Documentation by Illumina Inc.
Unveiling Illumina Connected Annotations: A breakthrough in genomic annotation by Illumina Inc. (2024)
DRAGEN Bio-IT Platform by Illumina Inc.
DRAGEN Sets New Standard for Data Accuracy in PrecisionFDA Benchmark Data. Optimizing Variant Calling Performance with Illumina Machine Learning and DRAGEN Graph by Illumina Inc. (2022)
End-to-End User Flow: DRAGEN Analysis by Illumina Inc.

See 17 usage examples →

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies. The gnomAD Principal Investigators and team can be found here, and the groups that have contributed data to the current release are listed here. Sign up for the gnom...

Usage examples

gnomAD v3.0 by Laurent Francioli, Daniel MacArthur
gnomAD quality control GitHub repository by gnomAD Production Team
The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020) by Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., Gauthier, L. D., Brand, H., Solomonson, M., Watts, N. A., Rhodes, D., Singer-Berk, M., England, E. M., Seaby, E. G., Kosmicki, J. A., ... MacArthur, D. G.
Characterising the loss-of-function impact of 5’ untranslated region variants in 15,708 individuals. Nature Communications 11, 2523 (2020) by Whiffin, N., Karczewski, K. J., Zhang, X., Chothani, S., Smith, M. J., Gareth Evans, D., Roberts, A. M., Quaife, N. M., Schafer, S., Rackham, O., Alföldi, J., O’Donnell-Luria, A. H., Francioli, L. C., Genome Aggregation Database (gnomAD) Production Team, Genome Aggregation Database (gnomAD) Consortium, Cook, S. A., Barton, P. J. R., MacArthur, D. G., & Ware, J. S.
A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024) by Chen, S., Francioli, L. C., Goodrich, J. K., Collins, R. L., Wang, Q., Alföldi, J., Watts, N. A., Vittal, C., Gauthier, L. D., Poterba, T., Wilson, M. W., Tarasova, Y., Phu, W., Yohannes, M. T., Koenig, Z., Farjoun, Y., Banks, E., Donnelly, S., Gabriel, S., Gupta, N., Ferriera, S., Tolonen, C., Novod, S., Bergelson, L., Roazen, D., Ruano-Rubio, V., Covarrubias, M., Llanwarne, C., Petrillo, N., Wade, G., Jeandet, T., Munshi, R., Tibbetts, K., gnomAD Project Consortium, O’Donnell-Luria, A., Solomonson, M., Seed, C., Martin, A. R., Talkowski, M. E., Rehm, H. L., Daly, M. J., Tiao, G., Neale, B. M., MacArthur, D. G. & Karczewski, K. J.

See 15 usage examples →

The Singapore Nanopore Expression Data Set

bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics

The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...

Usage examples

nf-core/nanoseq: A nanopore DNA and RNA-Seq demultiplexing, QC, alignment and analysis pipeline by Chelsea Sawyer et al.
JAFFAL: Detecting fusion genes with long read transcriptome sequencing. by Nadia M Davidson et al.
Performing transcript discovery and quantification with Bambu by Min Hao Ling
Detection of m6A from direct RNA sequencing using a Multiple Instance Learning framework. by Christopher Hendra et al.
JAFFAL: Detection of fusion genes from long read RNA-Seq data by Nadia M Davidson et al.

See 15 usage examples →

RADIANT Public Data

cancergeneticgenomicHomo sapienslife sciencesmedical imagingpediatricradiologytranscriptomicswhole genome sequencing

The Real-time Analysis and Discovery in Integrated And Networked Technologies (RADIANT) initiative seeks to develop an extensible, federated framework for rapid exchange of multimodal clinical and research data on behalf of accelerated discovery and patient impact. Coordination and implementation of initial RADIANT deployments will leverage a network of more than 35 partnered health care systems and participating patient families within the Children’s Brain Tumor Network (CBTN) and the Pediatric Neuro-Oncology Consortium (PNOC). This data set is composed of public multi-modal data provisio...

Usage examples

Generation and multi-dimensional profiling of a childhood cancer cell line atlas defines new therapeutic opportunities by Claire Xin Sun, Paul Daniel, Gabrielle Bradshaw et al.
Flywheel (CHOP D3b) by Flywheel
Multi-scale signaling and tumor evolution in high-grade gliomas by Jingxian Liu, Song Cao, Kathleen J Imback, et al.
Germline analysis of an international cohort of pediatric diffuse midline glioma patients by Marion K Mateos, Pamela Ajuyah, Noemi Fuentes-Bolanos, et al.
A road map for the treatment of pediatric diffuse midline glioma by Carl Koschmann, Wajd N Al-Holou, Marta M Alonso, et al.

See 13 usage examples →

Open Targets

bioinformaticsbiologydrug discoverygeneticgenomiclife sciencesprotein

The Open Targets Platform is a comprehensive data integration tool that supports systematic identification and prioritisation of potential therapeutic drug targets. By integrating publicly available datasets including data generated by the Open Targets experimental and informatics research programmes, the Platform provides data and services to assist in the task of therapeutic hypothesis building.

Usage examples

See 11 usage examples →

The Cancer Dependency Map (DepMap) Cancer Cell Line Encyclopedia (CCLE) Dataset

bambioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesshort read sequencingtranscriptomicswhole exome sequencingwhole genome sequencing

This dataset consists of whole genome sequencing (WGS), whole exome sequencing (WES), and RNA sequencing files generated from ~1000 cancer cell lines described in Ghandi et al., 2019.

Usage examples

Cancer Cell Line Encyclopedia (CCLE) by Ghandi, Huang, Jané-Valbuena et al.
Integrated cross-study datasets of genetic dependencies in cancer by Pacini, Dempster, Boyle et al.
DepMap Omics CCLE data on the AWS Open Data Registry by Devin McCabe
The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks by Ben Guebila, Wang, Lopes-Ramos et al.
The present and future of the Cancer Dependency Map by Arafeh, Shibue, Dempster et al.

See 11 usage examples →

Alliance of Genome Resources

bioinformaticsbiologyCaenorhabditis elegansDanio rerioDrosophila melanogasterfastagene expressiongeneticgenomegenomicHomo sapienslife sciencesMus musculusproteinRattus norvegicustranscriptomicsvcf

The Alliance of Genome Resources is a consortium that integrates genomic, genetic, and molecular data from leading model organism databases including Drosophila melanogaster, Caenorhabditis elegans, Danio rerio (zebrafish), Mus musculus (mouse), Rattus norvegicus (rat), Saccharomyces cerevisiae (yeast), Xenopus laevis and Xenopus tropicalis (frogs), and human reference data. The Alliance provides comprehensive datasets including gene annotations, disease associations, expression data (bulk and single-cell RNA-Seq), protein and genetic interactions, orthology relationships, variants and alleles...

Usage examples

See 10 usage examples →

Garvan Institute Long Read Sequencing Benchmark Data

bioinformaticsgenomiclife scienceslong read sequencing

The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Examples include: NA12878 (HG001) and NA24385 (HG002) sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells; and, UHR RNA (direct-RNA) on an ONT PromethION using the latest RNA004 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for ...

Usage examples

Directly processing on an s3fs mount by Hasindu Gamaarachchi
Slow5tools: toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.
Slow5curl: library and tool for accessing remote BLOW5 files. by Wong, B., et al.
Flexible and efficient handling of nanopore sequencing signal data with slow5tools. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.
Slow5lib: toolkit slow5lib is a software library for reading & writing SLOW5 files. by Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al.

See 10 usage examples →

NHGRI AnVIL Project

biologygene expressiongenomegenomicHomo sapienslife sciences

The NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL) Project (https://anvilproject.org/) is the National Human Genome Research Institute's cloud-based platform for genomic data sharing and analysis. AnVIL hosts widely used human genome reference datasets generated through NHGRI-funded research. AnVIL on Open Data on AWS provides public access to open-access datasets available through AnVIL. The project is a collaborative effort involving NHGRI, the Broad Institute, Johns Hopkins University, the University of California Santa Cruz, Vanderbilt University Medical Center, Brigh...

Usage examples

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References by Dylan J. Taylor, Jordan M. Eizenga, Qiuhui Li, Arun Das, Katharine M. Jenike, Eimear E. Kenny, Karen H. Miga, Jean Monlong, Rajiv C. McCoy, Benedict Paten, and Michael C. Schatz
CNPI: Rapid Analyses of Human Copy Number Data by Jack Ustanik, Tychele N. Turner
A complete reference genome improves analysis of human genetic variation by Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, Justin Wagner, Jennifer McDaniel, Nathan D. Olson, Michael E. G. Sauria, Mitchell R. Vollger, Arang Rhie, Melissa Meredith, Skylar Martin, Joyce Lee, Sergey Koren, Jeffrey A. Rosenfeld, Benedict Paten, Ryan Layer, Chen-Shan Chin, Fritz J. Sedlazeck, Nancy F. Hansen, Danny E. Miller, Adam M. Phillippy, Karen H. Miga, Rajiv C. McCoy, Megan Y. Dennis, Justin M. Zook, Michael C. Schatz
Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing by Sam Kovaka, Shujun Ou, Katharine M. Jenike, Michael C. Schatz
Deciphering the impact of genomic variation on function by IGVF Consortium

See 13 usage examples →

PubSeq - Public Sequence Resource

bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL

COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.

Usage examples

See 9 usage examples →

Cancer Cell Line Encyclopedia (CCLE)

cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.

Usage examples

ISB CGC BigQuery tables by Institute for Systems Biology
Genomic Data Commons by National Cancer Institute
Pharmacogenomic agreement between two cancer cell line data sets by The Cancer Cell Line Encyclopedia Consortium & The Genomics of Drug Sensitivity in Cancer Consortium
Next-generation characterization of the Cancer Cell Line Encyclopedia by Ghandi, M., Huang F. et al.
Cancer Genomics Cloud by Seven Bridges

See 8 usage examples →

Logan Unitigs and Contigs of the Sequence Read Archive (SRA) on AWS

fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...

Usage examples

Logan - Planetary-Scale Genome Assembly Surveys Life’s Diversity by Chikhi R., Lemane T., Loll-Krippleber R., et al (2025)
Open Virome by Artem Babaian
Search for sequences inside unitigs or contigs by Rayan Chikhi
Logan Search by Pierre Peterlongo
f2sz by Anton Korobeynikov

See 8 usage examples →

NIH Roadmap Epigenomics

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The project has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility, DNA methylation and mRNA expression across 100s of human cell types and tissues. To see what data is available, please check the directory listing: https://roadmapepigenomics.s3.us-west-2.amazonaws.com/index.html.

Usage examples

WashU Epigenome Browser update 2019 by Daofeng Li, Silas Hsu, Deepak Purushotham, Renee L Sears and Ting Wang
Navigation of Roadmap data using Roadmap web portal by Anshul Kundaje Lab
Human body epigenome maps reveal noncanonical DNA methylation variation by Matthew D. Schultz, Yupeng He, John W. Whitaker, Manoj Hariharan, Eran A. Mukamel, Danny Leung, Nisha Rajagopal, Joseph R. Nery, Mark A. Urich, Huaming Chen, Shin Lin, Yiing Lin, Inkyung Jung, Anthony D. Schmitt, Siddarth Selvaraj, Bing Ren, Terrence J. Sejnowski, Wei Wang & Joseph R. Ecker
Integrative analysis of 111 reference human epigenomes by Roadmap Epigenomics Consortium, Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour etc.al, Ting Wang, Manolis Kellis
Visualize Roadmp data with WashU Epigenome Browser by WashU Epigenome Browser

See 8 usage examples →

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Usage examples

Metabolic effects of air pollution exposure and reversibility by Rajagopalan S, Park B, Palanivel R, et al.
Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S.
Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al.
The role of environmental exposures and the epigenome in health and disease. by Perera BPU, Faulk C, Svoboda LK, Goodrich JM, Dolinoy DC.
Comparison of differential accessibility analysis strategies for ATAC-seq data by Gontarz P, Fu S, Xing X, Liu S, Miao B et.al.

See 8 usage examples →

CIViC (Clinical Interpretation of Variants in Cancer)

cancergeneticgenomiclife sciencesvcf

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...

Usage examples

See 7 usage examples →

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.

Usage examples

Genomic Data Commons by National Cancer Institute
Cancer Genomics Cloud by Seven Bridges
Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities by Suhas Vasaikar, Chen Huang, Xiaojing Wang. Vladislav A. Petyuk, Sara R. Savage, Bo Wen, Yongchao Dou, Yun Zhang, Zhiao Shi, Osama A. Arshad, Marina A. Gritsenko, Lisa J. Zimmerman, Jason E. McDermott, Therese R. Clauss, Ronald J. Moore, Rui Zhao, Matthew E. Monroe, Yi-Ting Wang, Matthew C. Chambers, Robbert J.C. Slebos, Ken S. Lau, Qianxing Mo, Li Ding, Matthew Ellis, Mathangi Thiagarajan, Christopher R. Kinsinger, Henry Rodriguez, Richard D. Smith, Karin D. Rodland, Daniel C. Liebler, Tao Liu, Bing Zhang, Clinical Proteomic Tumor Analysis Consortium
Proteomic analysis of colon and rectal carcinoma using standard and customized databases by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC
CPTAC Data Portal by National Cancer Institute

See 7 usage examples →

ICGC on AWS

bamcancergeneticgenomiclife sciencesvcf

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Usage examples

See 7 usage examples →

SnpEff & SnpSift Genomic Variant Annotation Databases

bioinformaticscancergeneticgenomegenomiclife sciencesproteinstructural variationtranscriptomicsvariant annotationvcfwhole exome sequencingwhole genome sequencing

SnpEff is a variant annotation and effect prediction tool that annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes). It supports over 38,000 genomes and provides comprehensive genomic databases for variant annotation. The databases include reference genomes, gene annotations, protein sequences, and regulatory elements from trusted sources like ENSEMBL, RefSeq, and UCSC. SnpSift complements SnpEff by providing tools to annotate genomic variants using databases, filter large genomic datasets, and manipulate annotated variants. Together, these ...

Usage examples

See 7 usage examples →

Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Cancer Genomics Cloud by Seven Bridges
Evaluation of NCI-7 Cell Line Panel as a Reference Material for Clinical Proteomics by Clark DJ, Hu Y, Bocik W, Chen L, Schnaubelt M, Roberts R, Shah P, Whiteley G, Zhang H
Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma by Clark DJ, Dhanasekaran SM, Petralia F, Pan J, Song X, Hu Y, da Veiga Leprevost F, Reva B, Lih TM, Chang HY, Ma W, Huang C, Ricketts CJ, Chen L1, Krek A, Li Y, Rykunov D, Li QK, Chen LS, Ozbek U, Vasaikar S, Wu Y, Yoo S, Chowdhury S, Wyczalkowski MA, Ji J, Schnaubelt M, Kong A, Sethuraman S, Avtonomov DM, Ao M, Colaprico A, Cao S, Cho KC, Kalayci S, Ma S, Liu W, Ruggles K, Calinawan A, Gümüş ZH, Geizler D, Kawaler E, Teo GC, Wen B, Zhang Y, Keegan S, Li K, Chen F, Edwards N, Pierorazio PM, Chen XS, Pavlovich CP, Hakimi AA, Brominski G, Hsieh JJ, Antczak A, Omelchenko T, Lubinski J, Wiznerowicz M, Linehan WM, Kinsinger CR, Thiagarajan M, Boja ES, Mesri M, Hiltke T, Robles AI, Rodriguez H, Qian J, Fenyö D, Zhang B, Ding L, Schadt E, Chinnaiyan AM, Zhang Z, Omenn GS, Cieslik M, Chan DW, Nesvizhskii AI, Wang P, Zhang H; Clinical Proteomic Tumor Analysis Consortium
Proteomic Data Commons by National Cancer Institute
CPTAC Data Portal by National Cancer Institute

See 6 usage examples →

Open Bioinformatics Reference Data for Galaxy

bioinformaticsbiologygeneticgenomiclife sciencesreference index

This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...

Usage examples

Galaxy by Galaxy Project
Accessible, curated metagenomic data through ExperimentHub by Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron
Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages by Tiago C. Silva, Antonio Colaprico, Catharina Olsen, Fulvio D'Angelo, Gianluca Bontempi, Michele Ceccarelli, Houtan Noushmehr

See 6 usage examples →

Serratus: Ultra-deep Search for Novel Viruses - Versioned Data Release

bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus

Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.

Usage examples

coronaSPAdes. From biosynthetic gene clusters to RNA viral assemblies by Meleshko D., Hajirasouliha I., and Korobeynikov A. (2021)
Diversification of mammalian deltaviruses by host shifting by Bergner L.M., Orton R.J., et al (2021)
Ribovirus classification by a polymerase barcode sequence by Babaian A., and Edgar R. (2021)
Serratus Explorer by Serratus Team
Tantalus: An R Package for exploration of Serratus data by Serratus Team

See 6 usage examples →

3000 Rice Genomes Project

agriculturefood securitygeneticgenomiclife sciences

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.

Usage examples

Structural variants in 3000 rice genomes by Fuentes RR et al (2019)
Rice Galaxy: an open resource for plant science by Juanillas V et al (2019)
RiceGalaxy by International Rice Research Institute
Identification and Allele Combination Analysis of Rice Grain Shape-Related Genes by Genome-Wide Association Study by Meng B et al (2022)
Tracking the origin of two genetic components associated with transposable element bursts in domesticated rice by Chen J et al (2019)

See 5 usage examples →

Caenorabditis Diversity Natural Resource

bambioinformaticsbiologyCaenorhabditis elegansfastqgatk-svgenetic mapsgenomegenome wide association studygenomiclife sciencesshort read sequencingvariant annotationvcf

The Caenorhabditis Natural Diversity Resource (CaeNDR) is a data repository and analysis hub of wild strains of selfing Caenhorabditis species C. elegans, C. briggsae, and C. tropicalis from around the world to facilitate discovery of genetic variation across all three species through genome-wide association mappings to correlate genotype with phenotype and identify genetic variation underlying quantitative traits.

Usage examples

FAQ - AWS API by Erik Andersen
Data Releases - C. tropicalis by Erik Andersen
Data Releases - C. elegans by Erik Andersen
CaeNDR, the Ceanorhabditis Natural Diversity Resource by Crombie TA, McKeown R, Moya ND, Evans KS, Widmayer SJ, LaGrassa V, et al.
Data Releases - C. briggsae by Erik Andersen

See 5 usage examples →

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...

Usage examples

"Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple Myeloma Relating Clinical Outcomes To Genomic and Immunophenotypic Profiles" by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, Harrison B, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRF CoMMpass Network, Capone W, Carpten J, Lonial S.
"Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An Interim Analysis of the Mmrf Commpass Study" by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, Kristi Stephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M. Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, Winnie Liang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Sagar Lonial, MD
"Interim Analysis of the Mmrf Commpass Trial: Identification of Novel Rearrangements Potentially Associated with Disease Initiation and Progression" by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, Jessica Aldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer, Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD
Genomic Data Commons by National Cancer Institute
"Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysis of the Mmrf CoMMpass Study" by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, Jessica Aldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma C Gutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, Seungchan Kim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Daniel Auclair, PhD, Sagar Lonial, MD FACP

See 5 usage examples →

NIH NCBI Sequence Read Archive (SRA) on AWS

bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...

Usage examples

See 5 usage examples →

BUSCO Datasets

assemblybacteriabioinformaticsgenomiclife sciencesmetagenomicsopen source softwareproteinvirus

Lineage datasets for use with BUSCO software package. Each dataset contains HMM profiles for clade specific, universal, single-copy marker genes. Datasets are available across archaea, bacteria, eukaryota and virus domains. The repository also includes necessary data files for phylogenetic placement of an input assembly.

Usage examples

OrthoDB and BUSCO update - annotation of orthologs with wider sampling of genomes. by Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva
BUSCO - assessing genomic data quality and beyond. by Mosè Manni, Matthew R. Berkeley, Mathieu Seppey, Evgeny M. Zdobnov
BUSCO - from QC to gene prediction and phylogenomics by Matthew Berkeley
BUSCO Update - Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. by Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, Evgeny M Zdobnov

See 4 usage examples →

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Usage examples

BLAST+ Docker by NCBI BLAST
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman

See 4 usage examples →

Encyclopedia of DNA Elements (ENCODE)

bioinformaticsbiologygeneticgenomiclife sciences

The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a ...

Usage examples

See 4 usage examples →

Epigenomes of the Human Pangenome Reference Consortium (HPRC) Release 2

bioinformaticsbiologyepigenomicsgeneticgenomiclife sciences

The Human Pangenome Reference Consortium (HPRC) Release 2 represents a landmark achievement in genomics, providing high-quality phased genome assemblies from over 200 individuals with comprehensive functional genomics data. The HPRC Epigenome Browser provides researchers a way to explore all epigenomics data generated by release 2. The HPRC Epigenome Browser (HPRCEB) is a modern, interactive web portal that democratizes access to HPRC Release 2 epigenomics data through an intuitive interface supporting genome selection, data visualization, and bulk download capabilities. The portal integrates ...

Usage examples

"Modbed track: Visualization of modified bases in single-molecule sequencing" by Daofeng Li, Xiaoyu Zhuo, Jessica K. Harrison, Shane Liu, Ting Wang
WashU Epigenome Browser update 2025 by Chanrung Seng, Shane Liu, Wenjin Zhang, Xiaoyu Zhuo, Daofeng Li, Ting Wang
A draft human pangenome reference by Liao, WW., Asri, M., Ebler, J. et al.
"Get To Know A Dataset: HPRC Epigenome" by HPRC Epigenome Browser

See 4 usage examples →

Genome in a Bottle on AWS

geneticgenomiclife sciencesreference indexvcf

Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.

Usage examples

GA4GH Benchmarking Tools by GA4GH Benchmarking Team
High-coverage, long-read sequencing of Han Chinese trio reference samples by Wang Y et al (2019)
The Genome in a Bottle Github Project by Genome In A Bottle Consortium
Extensive sequencing of seven human genomes to characterize benchmark reference materials by Zook J et al (2016)

See 4 usage examples →

Molecular Profiling to Predict Response to Treatment (phs001965)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...

Usage examples

Genetic changes associated with relapse in favorable histology Wilms tumor: A Children's Oncology Group AREN03B2 study by Samantha Gadd, Vicki Huff, et al.
Finding the way to Wilms tumor by comparing the primary and relapse tumor samples by Filippo Spreafico, Sara Ciceri, et al.
Childhood Cancer Data Initiative Data Catalog by National Cancer Institute
Genomic Data Commons by National Cancer Institute

See 4 usage examples →

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Usage examples

See 4 usage examples →

The Impact of Variation on Function Consortium (IGVF)

bioinformaticsbiologygeneticgenomiclife sciences

The IGVF (Impact of Genomic Variation on Function) Consortium aims to understand how genomic variation affects genome function, which in turn impacts phenotype. The NHGRI is funding this collaborative program that brings together teams of investigators who will use state-of-the-art experimental and computational approaches to model, predict, characterize and map genome function, how genome function shapes phenotype, and how these processes are affected by genomic variation. These joint efforts will produce a catalog of the impact of genomic variants on genome function and phenotypes.
The Data Corpus consists of single-cell Genomics experiments (both single modal, and multimodal, typically snRNA-seq and snATAC-seq), Characterization experiments using Massively Parallel Reporter Assays (MPRAs) and CRISPR-screens along with a variety of protein mutatation assays, and Predictive Models. There are a huge variety of files in IGVF that are stored in the AWS OpenData Set so we recommend using the metadata file or browsing the IGVF D...

Usage examples

See 4 usage examples →

UK Biobank Linkage Disequilibrium Matrices

geneticgenome wide association studygenomiclife sciencespopulation genetics

Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.

Usage examples

PolyFun and PolyPred software by Omer Weissbrod
Functionally informed fine-mapping and polygenic localization of complex trait heritability by Weissbrod et al.
Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores by Weissbrod et al.
PolyFun Wiki by Omer Weissbrod

See 4 usage examples →

UK Biobank Pan-Ancestry Summary Statistics

geneticgenome wide association studygenomiclife sciencespopulation genetics

A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.

Usage examples

Hail by Hail Team
Hail Tutorials by Hail Team
Hail on AWS Quick Start by Amazon Web Services and PrivoIT
Pan-ancestry genetic analysis of the UK Biobank by Pan UKBB Team

See 4 usage examples →

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Usage examples

Genomic Data Commons by National Cancer Institute
Functional Genomic Landscape of Acute Myeloid Leukemia by Jeffrey W. Tyner, Cristina E. Tognon, Dan Bottomly et al.
Clinical resistance to crenolanib in acute myeloid leukemia due to diverse molecular mechanisms by Zhang H, Savage S, Schultz AR, Bottomly D, White L, Segerdell E, et al.

See 3 usage examples →

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.

Usage examples

Genomic Data Commons by National Cancer Institute
Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson, Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., Monica Kasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D., Ph.D., Wenming Xiao, Ph.D., et al.
A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd, Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, Zhao H, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, Gascoyne RD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M, Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM

See 3 usage examples →

Exceptional Responders Initiative

cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...

Usage examples

GDC Legacy Archive by National Cancer Institute
The Exceptional Responders Initiative: Feasibility of a National Cancer Institute Pilot Study by Barbara A. Conley, Lou Staudt, et al.
Genomic Data Commons by National Cancer Institute

See 3 usage examples →

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

cancergenomiclife sciences

The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

Usage examples

High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis by Ryan J. Hartmaier, Lee A. Albacker, Juliann Chmielecki, Mark Bailey, Jie He, Michael E. Goldberg, Shakti Ramkissoon, James Suh, Julia A. Elvin, Samuel Chiacchia, Garrett M. Frampton, Jeffrey S. Ross, Vincent Miller, Philip J. Stephens and Doron Lipson
Genomic Data Commons by National Cancer Institute
Targeted next-generation sequencing of advanced prostate cancer identifies potential therapeutic targets and disease heterogeneity. by Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D, Tagawa ST, Nanus DM, Stephens PJ, Mosquera JM, Cronin MT, Rubin MA

See 3 usage examples →

NASA Space Biology Open Science Data Repository (OSDR)

bioinformaticsbiologyGeneLabgenomicimaginglife sciencesspace biology

NASA’s Space Biology Open Science Data Repository (OSDR) introduces a one-stop site where users can explore and contribute a variety of NASA open science biological data. This site consolidates data from the Ames Life Sciences Data Archive (ALSDA) and GeneLab and includes information about the broader NASA Open Science and Open Data initiatives, all at one centralized location. Our mission is to maximize the utilization of the valuable biological research resources and enable new discoveries.

OSDR introduces access to data generated from spaceflight and space relevant experiments that explore
...

Usage examples

GeneLab: Omics database for spaceflight experiments by Shayoni Ray, Samrawit Gebre, Homer Fogle, Daniel C Berrios, Peter B Tran, Jonathan M Galazka, Sylvain V Costes
NASA GeneLab: interfaces for the exploration of space omics data by Daniel C Berrios, Jonathan Galazka, Kirill Grigorev, Samrawit Gebre, Sylvain V Costes
Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration by Ryan T. Scott, Kirill Grigorev, Graham Mackintosh, Samrawit G. Gebre, Christopher E. Mason, Martha E. Del Alto, Sylvain V. Costes

See 3 usage examples →

Open Human Genome Library

bioinformaticsbiologygenomiclife sciences

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

Usage examples

AGC: compact representation of assembled genomes with fast queries and updates by Sebastian Deorowicz, Agnieszka Danek, Heng Li
Using OpenHGL data by Heng Li
BWT construction and search at the terabase scale by Heng Li

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Usage examples

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Usage examples

Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.
New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon

See 3 usage examples →

Variant Effect Predictor (VEP) and the Loss-Of-Function Transcript Effect Estimator (LOFTEE) Plugin

genome wide association studygenomiclife scienceslofteevep

VEP determines the effect of genetic variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. The European Bioinformatics Institute produces the VEP tool/db and releases updates every 1 - 6 months. The latest release contains 267 genomes from 232 species containing 5567663 protein coding genes. This dataset hosts the last 5 releases for human, rat, and zebrafish. Also, it hosts the required reference files for the Loss-Of-Function Transcript Effect Estimator (LOFTEE) plugin as it is commonly used with VEP.

Usage examples

See 3 usage examples →

1000 Genomes

fastqgeneticgenomiclife scienceswhole genome sequencing

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Usage examples

Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR by Alyssa Marrow
Examine genomic variation across populations with AWS by Konstantinos Tzouvanas

See 2 usage examples →

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...

Usage examples

See 2 usage examples →

Broad Genome References

bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index

Broad maintained human genome reference builds hg19/hg38 and decoy references.

Usage examples

Advancing NGS quality control to enable measurement of actionable mutations in circulating tumor DNA by Willey J. C., Morrison T. B., Austermiller B., Crawford E. E., et al (2021)
Using Amazon FSx for Lustre for Genomics Workflows on AWS by W. Lee Pang

See 2 usage examples →

Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer

cancergenomiclife sciencesSTRIDEStranscriptomics

The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Usage examples

Genomic Data Commons by National Cancer Institute
Genome-wide discovery of somatic coding and noncoding mutations in pediatric endemic and sporadic Burkitt lymphoma by Grande B. M., Gerhard D. S., Jiang A., Griner N. B., Abramson J. S., Alexander T. B., et al.

See 2 usage examples →

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

bioinformaticsbiologygenomiclife sciencesmappingmedicinereference indexwhole genome sequencing

Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.

Usage examples

Reducing reference bias using multiple population reference genomes by Chen et al (2020)
Table of contents for tutorials for constituent tools by Ben Langmead

See 2 usage examples →

DNAStack COVID19 SRA Data

bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...

Usage examples

Viral lineage assignment by Heather Ward
Viral AI by DNAstack

See 2 usage examples →

GATK Structural Variation (SV) Data

bioinformaticsbiologycromwellgatk-svgeneticgenomiclife sciencesstructural variation

This dataset holds the data needed to run a structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data in AWS.

Usage examples

AWS Setup & Execution by Goldfinch Bio and Loka Inc.
Structural Variant Analysis on AWS with Amazon FSx for Lustre by Goldfinch Bio and Loka Inc.

See 2 usage examples →

Genomic Characterization of Metastatic Castration Resistant Prostate Cancer

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Biopsies of castration resistant prostate cancer metastases were subjected to whole genome sequencing (WGS), along with RNA-sequencing (RNA-Seq). The overarching goal of the study is to illuminate molecular mechanisms of acquired resistance to therapeutic agents, and particularly androgen signaling inhibitors, in the treatment of metastatic castration resistant prostate cancer (mCRPC). This study is made available on AWS via the NIH STRIDES Initiative.

Usage examples

Genomic characterization of metastatic castration-resistant prostate cancer patients undergoing PSMA radioligand therapy: A single-center experience by Swayamjeet Satapathy, Chandan K Das, et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Usage examples

See 2 usage examples →

Human Cell Atlas

biologycell biologycell imaginggene expressiongenomegenomicHomo sapienslife sciencesMus musculussingle-cell transcriptomicstranscriptomics

The Human Cell Atlas (HCA) is a collaborative community of international scientists. Our mission is to create comprehensive reference maps of all the cells in the human body as a basis for both understanding human health and diagnosing, monitoring, and treating disease. The HCA registry has more than one thousand member scientists from hundreds of institutions around the world. The project is steered and governed by an Organizing Committee, co-chaired by Aviv Regev and Sarah Teichmann.

Usage examples

The Human Cell Atlas: towards a first draft atlas by Various authors
The network effect: studying COVID-19 pathology with the Human Cell Atlas by Sarah Teichmann, Aviv Regev
The Human Cell Atlas: towards a first draft atlas by Various authors
The Human Cell Atlas White Paper by Aviv Regev, Sarah Teichmann, Orit Rozenblatt-Rosen, Michael Stubbington, Kristin Ardlie, Ido Amit, Paola Arlotta, Gary Bader, Christophe Benoist, Moshe Biton, Bernd Bodenmiller, Benoit Bruneau, Peter Campbell, Mary Carmichael, Piero Carninci, Leslie Castelo-Soccio, Menna Clatworthy, Hans Clevers, Christian Conrad, Roland Eils, Jeremy Freeman, Lars Fugger, Berthold Goettgens, Daniel Graham, Anna Greka, Nir Hacohen, Muzlifah Haniffa, Ingo Helbig, Robert Heuckeroth, Sekar Kathiresan, Seung Kim, Allon Klein, Bartha Knoppers, Arnold Kriegstein, Eric Lander, Jane Lee, Ed Lein, Sten Linnarsson, Evan Macosko, Sonya MacParland, Robert Majovski, Partha Majumder, John Marioni, Ian McGilvray, Miriam Merad, Musa Mhlanga, Shalin Naik, Martijn Nawijn, Garry Nolan, Benedict Paten, Dana Pe'er, Anthony Philippakis, Chris Ponting, Steve Quake, Jayaraj Rajagopal, Nikolaus Rajewsky, Wolf Reik, Jennifer Rood, Kourosh Saeb-Parsy, Herbert Schiller, Steve Scott, Alex Shalek, Ehud Shapiro, Jay Shin, Kenneth Skeldon, Michael Stratton, Jenna Streicher, Henk Stunnenberg, Kai Tan, Deanne Taylor, Adrian Thorogood, Ludovic Vallier, Alexander van Oudenaarden, Fiona Watt, Wilko Weicher, Jonathan Weissman, Andrew Wells, Barbara Wold, Ramnik Xavier, Xiaowei Zhuang, Human Cell Atlas Organizing Committee
The Human Cell Atlas from a cell census to a unified foundation model by Jennifer E. Rood, Samantha Wynne, Lucia Robson, Anna Hupalowska, John Randell, Sarah A. Teichmann & Aviv Regev

See 5 usage examples →

Indexes for Kaiju

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

Usage examples

Quickstart Tutorial for downloading the index files and running Kaiju. by Peter Menzel
Fast and sensitive taxonomic classification for metagenomics with Kaiju by Peter Menzel et al (2016)

See 2 usage examples →

Integrative Analysis of Lung Adenocarcinoma in Environment and Genetics Lung cancer Etiology (Phase 2)

cancerepigenomicsgenomiclife sciencesSTRIDESwhole exome sequencingwhole genome sequencing

We performed whole genome sequencing and whole exome sequencing of 31 lung adenocarcinoma (LUAD) samples from the Environment And Genetics in Lung cancer Etiology (EAGLE) study. The EAGLE study is made available on AWS via the NIH STRIDES Initiative (https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).

Usage examples

See 2 usage examples →

National Cancer Institute Center for Cancer Research - Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expression

cancergenomiclife sciences

The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas (DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

ONT Methylation Benchmarking Datasets

bambenchmarkbioinformaticsepigenomicsgenomiclife scienceslong read sequencing

ONT Methylation Benchmarking Datasets are generated to benchmark existing methylation-calling tools on the Oxford Nanopore sequencing platform using their recent R10.4.1 flowcell chemistry. It spans a diverse range of species, including bacteria (E. coli, H. pylori J99, H. pylori 26695, A. variabilis, T. denticola), plants (Rice, Arabidopsis), and mammals (mouse, human).In addition, the dataset includes EMSeq data for E. coli, plant, and mouse samples, which can serve as ground truth for methylation studies. It also provides unmethylated whole-genome amplified (WGA) DNA for H. pylori 26695 and...

Usage examples

Methylation calling using ONT methylation benchmarking dataset by Onkar Kulkarni
Comprehensive benchmarking of tools for nanopore-based detection of DNA methylation by Kulkarni et al.

See 2 usage examples →

OpenCRAVAT

geneticgenomiclife sciencessqlitetertiary analysisvariant annotation

OpenCRAVAT is a module variant annotation tool developed by KarchinLab at Johns Hopkins. This dataset is a mirror of the OpenCRAVAT store available at https://store.opencravat.org. You can configure OpenCRAVAT to use this mirror by editing the "cravat-system.yml" file. The path to this file is in the first output line of the command "oc config system". In that file, change the value of "store_url" to "https://opencravat-store-aws.s3.amazonaws.com".

Usage examples

OpenCRAVAT by Karchinlab
Changing the OpenCRAVAT store url by Kyle Moad

See 2 usage examples →

Oregon Health & Science University Chronic Neutrophilic Leukemia Dataset

cancergenomiclife sciences

The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rare hematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloid leukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms (MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.

Usage examples

Genomic landscape of neutrophilic leukemias of ambiguous diagnosis by Zhang H, Wilmot B, Bottomly D et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

Pancreatic Cancer Organoid Profiling

cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Usage examples

Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer by Tiriac H, Belleau P, Engle DD, Plenker D, Deschênes A, Somerville TD, et al.
Genomic Data Commons by National Cancer Institute

See 2 usage examples →

RNA structure by fragmentation frequency

bioinformaticsgenomiclife sciencestranscriptomics

The fragSTRUC project devises a software to extract RNA secondary structure information from Illumina datasets, based on divalent ions in standard RNA-seq library preparation fragmenting sequences at non-base-paired regions of RNA.

Usage examples

Accessing the fragSTRUC dataset on AWS by Yuk Kei Wan and Leonard Schärfen
fragSTRUC: RNA structure by fragmentation frequency by Yuk Kei Wan and Leonard Schärfen

See 2 usage examples →

Somatic Mosaicism across Human Tissues (SMaHT)

bambioinformaticsbiologygeneticgenomicimaginglife scienceswhole genome sequencing

The Somatic Mosaicism across Human Tissues (SMaHT) project is an NIH Common Fund consortium (2023-) aimed to comprehensively characterize somatic variation ("mosaicism") in normal human tissues. While most genetic studies have relied on blood-derived DNA, SMaHT captures the full spectrum of DNA variation across cell types, tissues, and organs from phenotypically normal individuals to better understand the role of somatic mosaicism in human development, aging, and disease progression.Researchers in the consortium develop and apply experimental and computational methods, paired with th...

Usage examples

The Somatic Mosaicism across Human Tissues Network by Coorens T, Oh J, Choi Y, Lim N, Zhao B, Voshall A et al.
Somatic Mosaicism across Human Tissues Data Portal by SMaHT Data Analysis Center (DAC)

See 2 usage examples →

Tabula Sapiens

biologyencyclopedicgeneticgenomichealthlife sciencesmedicinesingle-cell transcriptomics

Tabula Sapiens is a benchmark, first-draft human cell atlas of over 1.1M cells from 28 organs of 24 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments. We...

Usage examples

The Tabula Sapiens: a multiple organ single cell transcriptomic atlas of humans by The Tabula Sapiens Consortium
Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues by The Tabula Sapiens Consortium, Stephen R Quake

See 2 usage examples →

Aging Mouse Brain Epigenetic

bamcramfastqgeneticgenomiclife sciencestranscriptomicswhole exome sequencingwhole genome sequencing

Aging is a major risk factor for neurodegenerative diseases, yet underlying epigenetic mechanisms remain unclear. Here, we generated a comprehensive single-nucleus cell atlas of brain aging across multiple brain regions, comprising 132,551 single-cell methylomes and 72,666 joint chromatin conformation-methylome nuclei. Integration with companion transcriptomic and chromatin accessibility data yielded a cross-modality taxonomy of 36 major cell types.

Usage examples

Cell-type-specific transposable element demethylation and TAD remodeling in the aging mouse brain by Zeng, Q., Wei, T., Klein, A., Bartlett, A., Liu, H., Nery, J.R., Castanon, R., Osteen, J., Johnson, N.D., Wang, W., Ding, W., Chen, H., Altshul, J., Kenworthy, M., Valadon, C., Owens, W., Wu, Z., Amaral, M.L., Song, Báez-Becerra, T.a.t.i.a.n.a., Cho, S., Chen, C., Willier, J., Cao, S., Rink, J., Lee, J., Barcoma, A., Arzavala, J., Emerson, N., Lu, Y.R., Ren, B., Behrens, M.a.r.g.a.r.i.t.a., Ecker, J.R.

See 1 usage example →

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...

Usage examples

Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example →

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca...

Usage examples

Genomic Data Commons by National Cancer Institute

See 1 usage example →

Human PanGenomics Project

cramfast5fastqgeneticgenomiclife sciences

This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.

Usage examples

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes by Shafin et al (2020)

See 1 usage example →

Oxford Nanopore Technologies Benchmark Datasets

bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

Usage examples

ONT Dataset Tutorials by EPI2MELabs

See 1 usage example →

Synthea Coherent Data Set

bioinformaticscsvdicomgenomichealthimaginglife sciencesmedicine

This is a synthetic data set that includes FHIR resources, DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. FHIR links all the data types together.

Usage examples

The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive Synthetic Health Record. by Walonoski J, Hall D, Bates KM, Farris MH, Dagher J, Downs ME, Sivek RT, Wellner B, Gregorowicz A, Hadley M, Campion FX, Levine L, Wacome K, Emmer G, Kemmer A, Malik M, Hughes J, Granger E, Russell S.

See 1 usage example →

Tabula Muris

biologyencyclopedicgenomichealthlife sciencesmedicine

Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...

Usage examples

Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. by Tabula Muris Consortium (2019)

See 1 usage example →

Tabula Muris Senis

biologyencyclopedicgenomichealthlife sciencesmedicinesingle-cell transcriptomics

Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat...

Usage examples

Fast queries of scRNAseq datasets with Amazon Athena by Andrew Ang, James Golden, Lisa McFerrin, and Lee Pang

See 1 usage example →

iHART Whole Genome Sequencing Data Set

autism spectrum disorderbamgeneticgenomiclife sciencesvcfwhole genome sequencing

iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).

Usage examples

Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks by Ruzzo et al. (2020)

See 1 usage example →

recount3

bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.

Usage examples

recount3 quick start guide by Leonardo Collado-Torres

See 1 usage example →

Australasian Genomes

biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the University of Sydney on behalf of our collaborators within TSI and CIPPS.

CartoStore

bioinformaticsgenomiclife sciencesspatial omicsspatial transcriptomics

Cross-Platform Repository for High-resolution Spatial Transcriptomics Datasets.

Usage examples

CartoStore Overview by Hyun Min Kang and Weiqiu Cheng
Cartloader Documentation by Hyun Min Kang and Weiqiu Cheng
Example CartoStore Repository for Xenium Breast Cancer Dataset by Hyun Min Kang and Weiqiu Cheng

See 3 usage examples →

GATK Test Data

bioinformaticsbiologycancergeneticgenomiclife sciences

The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute's Genome Analysis Toolkit (GATK).

Genome Ark

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.

MetaGraph Sequence Indexes

analysis ready databiodiversitybioinformaticsbiologyfastagenomegenomicgraphinformation retrievallife sciencesmedicinemetagenomicsmicrobiometranscriptomicswhole exome sequencingwhole genome sequencing

The MetaGraph Sequence Indexes dataset comprises full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA).All index files can be used with the MetaGraph framework for sequence search. Indexes can be jointly used for aggregated search in the cloud or can be individually downloaded...

Usage examples

Usage within AWS by Oleksandr Kulkov
CloudFormation stack with a Step Function for dataset queries via AWS Batch by Oleksandr Kulkov
A global metagenomic map of urban microbiomes and antimicrobial resistance by Danko D, Bezdan D, Afshin EE, Ahsanuddin S, Bhattacharya C, Butler DJ, Chng KE, Donnellan D, Hecht J, Jackson K, Kuchin K, Karasikov M, Lyons A, Mak L, Meleshko D, Mustafa H, et al.

See 3 usage examples →

Metagenomic reference libraries for Slacken

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiome

Metagenomic indexes for use with the Slacken taxonomic classification tool

Usage examples

Precise and scalable metagenomic profiling with sample-tailored minimizer libraries by Johan Nyström-Persson, Nishad Bapatdhar and Samik Ghosh
Classifying metagenomic samples on AWS ElasticMapReduce by Johan Nyström-Persson
Slacken by Johan Nyström-Persson, Nishad Bapatdhar

See 3 usage examples →

Nanopore Reference Human Genome

geneticgenomiclife scienceswhole genome sequencing

This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.

SocialGene RefSeq Databases

amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein

Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.

Usage examples

See 3 usage examples →

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

UCSC Genome Browser Sequence and Annotations

bioinformaticsbiologygeneticgenomiclife sciences

The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflower's genetic diversity originating from thousands of wild, cultivated, and landrace sunflower individuals distributed across North America.The data consists of raw sequences and associated botanical metadata, aligned sequences (to three different reference genomes), and sets of SNPs computed across several cohorts.

GenomeKit genomic data

bioinformaticsgenomegenomicHomo sapienslife sciencesMus musculusnon-human primateopen source softwareRattus norvegicusvariant annotation

GenomeKit is Deep Genomics’ Python library for fast and easy access to genomic resources such as sequence, data tracks, and annotations. The goal is to let machine learning researchers build data sets easily, and to be creative about how those data sets are designed. Out of the box, GenomeKit provides access to pre-built optimized genomic data files that are required for its operation.

Usage examples

See 2 usage examples →

Platinum Pedigree

bioinformaticsgenomicgenotypingHomo sapienslife scienceslong read sequencingwhole genome sequencing

The Platinum Pedigree Consortium (PCC) is a collaborative project to create a comprehensive reference for human genetic variation using a four-generation, 28-member family (CEPH-1463). We employed five different short and long-read sequencing technologies to generate phased assemblies and characterize both inherited and de novo variation, including at some of the most difficult to genotype genomic regions such as tandem repeats, centromeres, and the Y chromosome. This extensive "truth set" is publicly available and can be used to test and benchmark new algorithms and technologies to ...

Usage examples

See 2 usage examples →

1KG-ONT-VIENNA panel

fast5fastqgeneticgenomiclife scienceswhole genome sequencing

The 1KG-ONT-VIENNA panel comprises medium coverage ONT sequencing data for 1.019 samples from the 1000 Genomes Project collection, structural variants, and their haplotype context.

Usage examples

Long-read sequencing and structural variant characterization in 1,019 samples from the 1000 Genomes Project by Siegfried Schloissnig, Samarendra Pani, Bernardo Rodriguez-Martin, Jana Ebler, Carsten Hain, Vasiliki Tsapalou, Arda Söylev, Patrick Hüther, Hufsah Ashraf, Timofey Prodanov, Mila Asparuhova, Sarah Hunt, Tobias Rausch, Tobias Marschall, Jan O Korbel

See 1 usage example →

AWS iGenomes

agricultureamazon.sciencebiologyCaenorhabditis elegansDanio reriogeneticgenomicHomo sapienslife sciencesMus musculusRattus norvegicusreference index

Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.

Usage examples

nf-core analysis pipelines by Phil Ewels

See 1 usage example →

AllTheBacteria

assemblybacteriabioinformaticsfastagenomiclife sciencesmicrobial genomicsshort read sequencingwhole genome sequencing

All bacterial isolate whole-genome sequencing data from INSDC, uniformly assembled, quality-controlled, annotated, and searchable.

Usage examples

AllTheBacteria - all bacterial genomes assembled, available and searchable by Hunt M, Lima L, Anderson D, Hawkey J, Shen W, Lees J, Iqbal I

See 1 usage example →

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Usage examples

An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example →

OceanOmics

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

Minderoo Foundation OceanOmics aims to establish environmental DNA (eDNA) as a tool to measure, understand, and protect oceans. OceanOmics mainly generates two types of data: eDNA sequencing data (metabarcoding, metagenomics), and genome assembly data (marine vertebrates).

Usage examples

Case-studies on using OceanOmics genomes and eDNA data by Philipp Bayer

See 1 usage example →