This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry tagged with cancer.
You are currently viewing a subset of data tagged with cancer.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...
cancergenomiclife sciencesSTRIDESwhole genome sequencing
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...
cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing
The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...
bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle
The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute. A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...
cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing
The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.
cancergeneticgenomiclife sciencesvcf
Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.
bamcancergeneticgenomiclife sciencesvcf
The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the all data for the CAncer MEtastases in LYmph nOdes challeNge or CAMELYON. CAMELYON was the first challenge using whole-slide images in computational pathology and aimed to help pathologists identify breast cancer metastases in sentinel lymph nodes. Lymph node metastases are extremely important to find, as they indicate that the cancer is no longer localized and systemic treatment might be warranted. Searching for these metastases in H&E-stained tissue is difficult and time-consuming and AI algorithms can play a role in helping make this faster and more accura...
cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing
The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...
cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging
This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...
biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology
This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.
cancergeneticgenomicHomo sapienslife sciencesSTRIDES
Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...
bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index
Broad maintained human genome reference builds hg19/hg38 and decoy references.
cancercomputational pathologycomputer visiondeep learninghistopathologylife sciences
This page describes the COBRA (Classification Of Basal cell carcinoma, Risky skin cancers and Abnormalities) skin pathology dataset, which comprises over 7000 histopathology whole-slide-images related to the diagnosis of basal cell carcinoma skin cancer, the most commonly diagnosed cancer. The dataset includes biopsies and excisions and is divided into four groups. The first group contains about 2,500 BCC biopsies with subtype labels, while the second group includes 2,500 non-BCC biopsies with different types of skin dysplasia. The third group has 1,000 labelled risky cancer biopsies, includin...
cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing
The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.
cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...
cancergenomiclife sciences
The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.
cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmicroscopyradiology
Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.
cancergenomiclife sciencesSTRIDEStranscriptomics
The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...
biasbiologycancerhealthimaginglife sciencesmammographyx-ray
EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT) images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity classes. This release represents 20% of the total 2D and C-view dataset and is available for research use. DBT, US, and MRI exams will be added at a ...
cancergenomiclife sciencesSTRIDESwhole genome sequencing
Biopsies of castration resistant prostate cancer metastases were subjected to whole genome sequencing (WGS), along with RNA-sequencing (RNA-Seq). The overarching goal of the study is to illuminate molecular mechanisms of acquired resistance to therapeutic agents, and particularly androgen signaling inhibitors, in the treatment of metastatic castration resistant prostate cancer (mCRPC). This study is made available on AWS via the NIH STRIDES Initiative.
cancerepigenomicsgenomiclife sciencesSTRIDESwhole exome sequencingwhole genome sequencing
We performed whole genome sequencing and whole exome sequencing of 31 lung adenocarcinoma (LUAD) samples from the Environment And Genetics in Lung cancer Etiology (EAGLE) study. The EAGLE study is made available on AWS via the NIH STRIDES Initiative (https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the all data for the LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge or LEOPARD.Prostate cancer, impacting 1.4 million men annually, is a prevalent malignancy (H. Sung et al., 2021). A substantial number of these individuals undergo prostatectomy as the primary curative treatment. The efficacy of this surgery is assessed, in part, by monitoring the concentration of prostate-specific antigen (PSA) in the bloodstream. While the role of PSA in prostate cancer screening is debatable (W. F. Clark et al., 2018; E. A. M. Heijnsdijk et al., 2018), it serves as a valuable biomarker for postprostatectomy follow-up in patients. Following successful surgery, PSA concentration is typically undetectable (<0.1 ng/mL) within 4-6 weeks (S. S. Goonewardene et al., 2014). However, approximately 30% of patients experience biochemical recurrence, signifying the resurgence of prostate cancer cells. This recurrence serves as a prognostic indicator for progression to clinical metastases and eventual prostate cancer-related mortality (C. L. Amling, 2014; S. J. Freedland et al., 2005; M. Han et al., 2001; T. Van den Broeck et al., 2001. Current clinical practices gauge the risk of biochemical recurrence by considering the International Society of Urological Pathology (ISUP) grade, PSA value at diagnosis, and TNM staging criteria (J. I. Epstein et al., 2016). A recent European consensus guideline suggests categorizing patients into low-risk, intermediate-risk, and high-risk groups based on these factors (N. Mottet et al., 2021). Notably, a high ISUP grade independently assigns a patient to the intermediate (grade 2/3) or high-risk group (grade 4/5). The Gleason growth patterns, representing morphological patterns of prostate cancer, are used to categorize cancerous tissue into ISUP grade groups (J. I. Epstein, 2010; P. M. Pierorazio et al., 2013; G. J. L. H. van Leenders et al., 2020; J. I. Epstein et al., 2016). However, the ISUP grade has limitations, such as grading disagreement among pathologists (J. I. Epstein et al., 2016) and coarse descriptors of tissue morphology. Recently, deep learning was shown (H. Pinckaers et al., 2022; O. Eminaga et. al., 2024)...
cancergenomiclife sciences
The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas (DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification data.
cancergenomiclife sciences
The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rare hematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloid leukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms (MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.
cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing
This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemo...
bioinformaticsbiologycancergeneticgenomiclife sciences
The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute's Genome Analysis Toolkit (GATK).
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca...
biologycancercomputer visionhealthimage processingimaginglife sciencesmachine learningmagnetic resonance imagingmedical imagingmedicineneurobiologyneuroimagingsegmentation
This dataset contains 8,000+ brain MRIs of 2,000+ patients with brain metastases.
bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics
recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.
biologybreast cancercancercomputational pathologyhistopathologylife sciences
This is a retrospective dataset of 1523 H&E-stained whole slide images (WSI) of lymph nodes from breast cancer patients. The cohort consisted of 177 patients (122 LN-positive - metastasis was reported in at least 1 LN - and 55 LN-negative patients) with invasive breast carcinoma treated between 1984 and 2002 at Guy’s Hospital London, UK. Slides were scanned and digitised at 40x magnification (0.23 µm/pixel), NanoZoomer H.T2.0 2.0-HT (Hamamatsu Photonics UK, Ltd, Welwyn Garden City, UK). WSIs are in .ndpi format.
breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology
According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...
cancerlife sciencesmagnetic resonance imagingmedical imagingmedicineradiology
The University of California San Francisco Brain Metastases Stereotactic Radiosurgery (UCSF-BMSR) dataset is a public, clinical, multimodal brain MRI dataset consisting of 560 brain MRIs from 412 patients with expert annotations of 5136 brain metastases. Data consists of registered and skull stripped T1 post-contrast, T1 pre-contrast, FLAIR and subtraction (T1 pre-contrast - T1 post-contrast) images and voxelwise segmentations of enhancing brain metastases in NifTI format.