The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, and 4.2

bam bioinformatics biology cram genetic genomic genotyping life sciences machine learning population genetics short read sequencing structural variation tertiary analysis variant annotation whole genome sequencing

Description

This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural variant and de novo copy number variant calls on 602 trio families comprised of members from the 1000 Genomes Project Phase 3 dataset. Trio repeat expansion calling was included in the v3.7.6 dataset only. Joint cohort analysis was also performed on the entire 1KGP sample dataset (n=3202) for the v3.7.6, v4.0.3, and v4.2.7 reanalyses using DRAGEN GVCF Genotyper v3.8.3, v4.2.0, and v4.2.7, respectively (see 'Genotyping variants at population scale using DRAGEN gVCF Genotyper'). Improvements and new features in the v3.7.6 individual samples analyses include CYP2D6 variant calling and joint detection (see ‘DRAGEN 3.7 User Guide’ for details on these features) and use of graph-based hg19 and hg38 reference hash tables (see ‘DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes’ and 'Demystifying the versions of GRCh38/hg38 reference genomes, how they are used in DRAGEN and their impact on accuracy' for details). The DRAGEN v4.0.3 dataset features improved small variant calling accuracy due to utilization of a newly integrated machine learning functionality with an updated graph based reference for difficult to map regions (see ‘DRAGEN Sets New Standard for Data Accuracy in PrecisionFDA Benchmark Data. Optimizing Variant Calling Performance with Illumina Machine Learning and DRAGEN Graph’); accuracy and runtime improvements in the SV caller; new targeted callers including CYP2B6, GBA, SMN and a Star Allele PGx caller; and an expanded catalog for use with Expansion Hunter STR caller (see 'DRAGEN v4.0.3 Software Release Notes' for details on these and other new features and improvements). DRAGEN v4.2 offers significant accuracy improvements in small variant, CNV, and SV calling, includes new targeted callers (HBA, LPA, RH, CYP21A2, SMN silent carrier variant), and supports star allele calling for 5 additional pharmacogenes (BCHE, ABCG2, NAT2, F5, UGT2B17). See 'DRAGEN v4.2.4 Software Release Notes' for details about these and other new features and improvements for v4.2. Starting with the v4.0.3 reanalysis, annotation using the Illumina Annotation Engine (Nirvana) was included as part of the analysis (see 'Nirvana Documentation' for more information). In both the v4.0.3 and v4.2.7 datasets, annotation was performed on the merged small variant VCF generated by the DRAGEN GVCF Genotyper for the entire 1KGP cohort. For v4.2.7, annotation was also performed on the CNV, SV, and STR VCFs for the entire cohort. For these, individual VCFs generated by the DRAGEN CNV, SV, or STR caller for all samples in the cohort were merged into one VCF prior to performing annotation.

Update Frequency

Files may be updated subsequent to changes to the 1000 Genomes Project data set or select new DRAGEN features or offerings.

License

TBD

Documentation

DRAGEN Support Resources

Managed By

Illumina, Inc.

See all datasets managed by Illumina, Inc..

Contact

Illumina, Inc.

How to Cite

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, and 4.2 was accessed on DATE from https://registry.opendata.aws/ilmn-dragen-1kgp.

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    BAM, SNV-vcf, SNV-gvcf, STR-vcf, STR-bam, SV-vcf, ROH-vcf, CNV-vcf, CNV-bw, metrics and other supporting files from DRAGEN v3.5.6b analyses in a public S3 bucket.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::1000genomes-dragen
    AWS Region
    us-west-2
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://1000genomes-dragen/
  • Description
    BAM, SNV-vcf, SNV-gvcf, STR-vcf, STR-bam, SV-vcf, ROH-vcf, CNV-vcf, CNV-bw, cyp2d6-tsv, metrics and other supporting files from DRAGEN v3.7.6 analyses in a public S3 bucket.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::1000genomes-dragen-3.7.6
    AWS Region
    us-west-2
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://1000genomes-dragen-3.7.6/
  • Description
    BAM, SNV-vcf, SNV-gvcf, STR-vcf, STR-bam, SV-vcf, ROH-vcf, CNV-vcf, CNV-bw, cyp2d6-tsv, metrics and other supporting files from DRAGEN v3.7.6 analyses in a public S3 bucket. This is a clone of the 1000genomes-dragen-3.7.6 bucket in the us-east-1 region.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::1000genomes-dragen-v3.7.6
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://1000genomes-dragen-v3.7.6/
  • Description
    CRAM, SNV-vcf, SNV-gvcf, STR-vcf, STR-bam, SV-vcf, ROH-vcf, CNV-vcf, CNV-bw, cyp2b6-tsv, cyp2d6-tsv, gba-tsv, smn-tsv, star-allele-tsv, metrics and other supporting files from DRAGEN v4.0.3 analyses and Nirvana Annotation in a public S3 bucket.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::1000genomes-dragen-v4.0.3
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://1000genomes-dragen-v4.0.3/
  • Description
    CRAM, SNV-vcf, SNV-gvcf, STR-vcf, STR-bam, SV-vcf, ROH-vcf, CNV-vcf, CNV-bw, cyp2b6-tsv, cyp2d6-tsv, gba-tsv, smn-tsv, star-allele-tsv, hla-tsv, gvcf, json, metrics and other supporting files from DRAGEN v4.2.7 analyses and Nirvana Annotation in a public S3 bucket.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::1000genomes-dragen-v4-2-7
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://1000genomes-dragen-v4-2-7/

Edit this dataset entry on GitHub

Tell us about your project

Home