Biohub bioinformatics life sciences machine learning metagenomics protein structural biology
The ESM Atlas is a large-scale public dataset of computational outputs generated by ESMC and ESMFold2, derived from a deduplicated set of over 6.8 billion publicly available protein sequences spanning all domains of life — including viral proteins and previously unannotated sequences representing metagenomic dark matter sampled from a wide range of biomes. The dataset includes two primary components. A sparse autoencoder (SAE) features for ~6.8 billion proteins, capturing interpretable biological representations from the ESMC 6B model, and predicted three-dimensional protein structures for ~1.1 billion proteins generated using ESMFold2. Proteins are organized into 7.7 million clusters based on SAE feature similarity, enabling functional grouping across the protein universe. The dataset is accessible via AWS CLI. A companion data explorer is available at https://biohub.ai/esmc/atlas.
This is the v1.0 release of the dataset. Updates will be announced via the dataset documentation page.
CC BY-SA 4.0
See all datasets managed by Biohub.
ESM Atlas — Protein Features and Structures was accessed on DATE from https://registry.opendata.aws/biohub-esm-atlas.
arn:aws:s3:::esm-protein-atlasus-west-2aws s3 ls --no-sign-request s3://esm-protein-atlas/