A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service.
See all datasets managed by National Library of Medicine (NLM).
COVID-19 Genome Sequence Dataset was accessed on
DATE from https://registry.opendata.aws/ncbi-covid-19.
sra-srcfolder are in FASTQ, BAM, or CRAM format (original submission); files in the
runfolder are in .sra format and require the SRA Toolkit
aws s3 ls --no-sign-request s3://sra-pub-sars-cov2/
aws s3 ls --no-sign-request s3://sra-pub-sars-cov2-metadata-us-east-1/