The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with metagenomics.


Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with metagenomics.


Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.


Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

Steinegger Lab Datasets

bioinformaticslife sciencesmetagenomicsopen source softwareproteinprotein folding

The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond ...

Details →

Usage examples

See 9 usage examples →

Logan Unitigs and Contigs of the Sequence Read Archive (SRA) on AWS

fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the realm of genome assembly, unitigs preserve near...

Details →

Usage examples

See 7 usage examples →

Kraken2 NCBI RefSeq Complete V205 database on AWS

benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome

Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205

Details →

Usage examples

See 3 usage examples →

QIIME 2 Tutorial Data

bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome

QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

Details →

Usage examples

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Details →

Usage examples

See 3 usage examples →

Indexes for Kaiju

bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

Details →

Usage examples

See 2 usage examples →

SocialGene RefSeq Databases

amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein

Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.

Details →

Usage examples

See 3 usage examples →