The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Steinegger Lab Datasets

bioinformatics life sciences metagenomics open source software protein protein folding

Description

The Steinegger Lab Dataset comprises biological databases and resources critical for protein sequence and structure analysis, developed to support ColabFold, MMseqs2, and Foldseek/Foldcomp—three high-performance computational tools widely used in bioinformatics.The MMseqs2 dataset serves as the backbone for our fast structure prediction tool, ColabFold, and includes UniRef30, BFD, and the ColabFold environmental databases. These datasets are specifically designed for the rapid generation of multiple sequence alignments (MSAs), which are essential for high-accuracy structure prediction. Beyond MSA generation, these resources allow for fast taxonomy annotations and functional annotation, supporting a wide range of bioinformatics applications.The Foldseek dataset includes preprocessed databases such as the AlphaFold Database (AFDB), PDB, SwissProt, and CATH, specifically designed for protein structure similarity searches. These datasets encompass the majority of both experimental and predicted structural resources, supporting analyses for monomers and multimers alike.

Update Frequency

Occasionally, where new data is available

License

CC BY 4.0

Documentation

For the MMseqs2/ColabFold dataset, please see https://colabfold.mmseqs.com For the Foldseek dataset, please see https://search.foldseek.com

Managed By

Steinegger Lab, Seoul National University

See all datasets managed by Steinegger Lab, Seoul National University.

Contact

martin.steinegger@snu.ac.kr

How to Cite

Steinegger Lab Datasets was accessed on DATE from https://registry.opendata.aws/steineggerlab. If you’re using MMseqs2, please cite: “Steinegger M and Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology (2017), doi: 10.1038/nbt.3988 If you're using Foldseek, please cite: "van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), doi:10.1038/s41587-023-01773-0" If you're using ColabFold, please cite: "Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold: Making protein folding accessible to all. Nature Methods (2022) doi: 10.1038/s41592-022-01488-1"

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    Steinegger Lab Datasets
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::steineggerlab
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://steineggerlab/

Edit this dataset entry on GitHub

Tell us about your project

Home