The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Sophos/ReversingLabs 20 Million malware detection dataset

cyber security deep learning labeled machine learning


A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

Update Frequency

At most annually


See the Terms of Use


Managed By

Sophos AI

See all datasets managed by Sophos AI.


How to Cite

Sophos/ReversingLabs 20 Million malware detection dataset was accessed on DATE from

Usage Examples

Tools & Applications

Resources on AWS

  • Description
    Sophos/ReversingLabs 20 million sample dataset
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    AWS Region
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://sorel-20m/

Edit this dataset entry on GitHub

Tell us about your project