This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry tagged with fastq.
You are currently viewing a subset of data tagged with fastq.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics
The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...
bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL
COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.
bambioinformaticsbiologyCaenorhabditis elegansfastqgatk-svgenetic mapsgenomegenome wide association studygenomiclife sciencesshort read sequencingvariant annotationvcf
The Caenorhabditis Natural Diversity Resource (CaeNDR) is a data repository and analysis hub of wild strains of selfing Caenhorabditis species C. elegans, C. briggsae, and C. tropicalis from around the world to facilitate discovery of genetic variation across all three species through genome-wide association mappings to correlate genotype with phenotype and identify genetic variation underlying quantitative traits.
bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...
amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome
The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...
fastqgeneticgenomiclife scienceswhole genome sequencing
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing
The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...
bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...
cramfast5fastqgeneticgenomiclife sciences
This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.
bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing
The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.
fast5fastqgeneticgenomiclife scienceswhole genome sequencing
The 1KG-ONT-VIENNA panel comprises medium coverage ONT sequencing data for 1.019 samples from the 1000 Genomes Project collection, structural variants, and their haplotype context.
amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing
To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.