Registry of Open Data on AWS

Steinegger Lab Datasets

bioinformaticslife sciencesmetagenomicsopen source softwareproteinprotein folding

The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond ...

Details →

Usage examples

Fast and accurate protein structure search with Foldseek by van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al.
Run ColabFold on your local computer by Moriwaki Y
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets by Steinegger M and Söding J
ColabFold User Guide by Mirdita M and Ovchinnikov S
ColabFold Tutorial by Ovchinnikov S, Mirdita M and Steinegger M

See 9 usage examples →

Logan Unitigs and Contigs of the Sequence Read Archive (SRA) on AWS

fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...

Details →

Usage examples

See 8 usage examples →

BUSCO Datasets

assemblybacteriabioinformaticsgenomiclife sciencesmetagenomicsopen source softwareproteinvirus

Lineage datasets for use with BUSCO software package. Each dataset contains HMM profiles for clade specific, universal, single-copy marker genes. Datasets are available across archaea, bacteria, eukaryota and virus domains. The repository also includes necessary data files for phylogenetic placement of an input assembly.

Details →

Usage examples

BUSCO Update - Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. by Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, Evgeny M Zdobnov
BUSCO - from QC to gene prediction and phylogenomics by Matthew Berkeley
OrthoDB and BUSCO update - annotation of orthologs with wider sampling of genomes. by Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva
BUSCO - assessing genomic data quality and beyond. by Mosè Manni, Matthew R. Berkeley, Mathieu Seppey, Evgeny M. Zdobnov

See 4 usage examples →

Kraken2 NCBI RefSeq Complete V205 database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205

Details →

Usage examples

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools by Robyn J. Wright, Andre M. Comeau and Morgan G.I. Langille
Using an Amazon Machine Image for analysing samples with Kraken2 by Robyn Wright
Kraken2 by Derrick Wood, Jennifer Lu and Ben Langmead

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Details →

Usage examples

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Details →

Usage examples

New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.
The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon

See 3 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Details →

Usage examples

See 2 usage examples →

Indexes for Kaiju

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

Details →

Usage examples

Quickstart Tutorial for downloading the index files and running Kaiju. by Peter Menzel
Fast and sensitive taxonomic classification for metagenomics with Kaiju by Peter Menzel et al (2016)

See 2 usage examples →

MetaGraph Sequence Indexes

analysis ready databiodiversitybioinformaticsbiologyfastagenomegenomicgraphinformation retrievallife sciencesmedicinemetagenomicsmicrobiometranscriptomicswhole exome sequencingwhole genome sequencing

The MetaGraph Sequence Indexes dataset comprises full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA).All index files can be used with the MetaGraph framework for sequence search. Indexes can be jointly used for aggregated search in the cloud or can be individually downloaded...

Details →