bam bioinformatics coronavirus COVID-19 fasta fastq genetic genomic global health life sciences long read sequencing SARS-CoV-2 vcf virus whole genome sequencing
The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodologies to be combined and compared for a more powerful global analysis of available SARS-CoV-2 data, allowing researchers rapid access to aggregated downstream results for accelerated insight generation. Methodology: Reads from the SRA were extracted in FASTQ format, then entered into a different pipeline depending on the sequencing technology used to create the reads: the ARTIC protocol for Oxford Nanopore-derived reads; the SIGNAL pipeline for paired-end Illumina reads; and the CoSA pipeline (using DeepVariant for variant calling) for PacBio reads. Briefly, reads were primer-trimmed and aligned to the SARS-CoV-2 reference genome, following which contiguous regions were assembled and variant sites were called. Pangolin was then used to assign viral lineage based on the assembled genome.
Rolling
https://github.com/DNAstack/dnastack-open-data
See all datasets managed by DNAstack.
DNAStack COVID19 SRA Data was accessed on DATE
from https://registry.opendata.aws/dnastack-covid-19-sra-data.
arn:aws:s3:::dnastack-covid-19-sra-data
us-west-2
aws s3 ls --no-sign-request s3://dnastack-covid-19-sra-data/