ESM Atlas — Protein Features and Structures

Biohub bioinformatics life sciences machine learning metagenomics protein structural biology

Description

The ESM Atlas is a large-scale public dataset of computational outputs generated by ESMC and ESMFold2, derived from a deduplicated set of over 6.8 billion publicly available protein sequences spanning all domains of life — including viral proteins and previously unannotated sequences representing metagenomic dark matter sampled from a wide range of biomes. The dataset includes two primary components. A sparse autoencoder (SAE) features for ~6.8 billion proteins, capturing interpretable biological representations from the ESMC 6B model, and predicted three-dimensional protein structures for ~1.1 billion proteins generated using ESMFold2. Proteins are organized into 7.7 million clusters based on SAE feature similarity, enabling functional grouping across the protein universe. The dataset is accessible via AWS CLI. A companion data explorer is available at https://biohub.ai/esmc/atlas.

Update Frequency

This is the v1.0 release of the dataset. Updates will be announced via the dataset documentation page.

License

CC BY-SA 4.0

Documentation

https://biohub.ai/esmc/atlas

Managed By

Biohub

See all datasets managed by Biohub.

Contact

support@biohub.org

How to Cite

ESM Atlas — Protein Features and Structures was accessed on DATE from https://registry.opendata.aws/biohub-esm-atlas.

Usage Examples

Tutorials

Resources on AWS

  • Description
    ESM Atlas — Protein sequences, SAE features, and predicted 3D structures
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::esm-protein-atlas
    AWS Region
    us-west-2
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://esm-protein-atlas/

Edit this dataset entry on GitHub

Tell us about your project

Home