Sophos/ReversingLabs 20 Million malware detection dataset

cyber security deep learning labeled machine learning

Description

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicous Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

Update Frequency

At most annually

License

See Terms of Use at "https://github.com/sophos-ai/SOREL-20M/Terms and Conditions of Use.pdf"

Documentation

https://github.com/sophos-ai/SOREL-20M/README.md

Managed By

Sophos AI

See all datasets managed by Sophos AI.

Contact

sorel-dataset@sophos.com

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    Sophos/ReversingLabs 20 million sample dataset
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::sorel-20m/
    AWS Region
    us-west-2
    AWS CLI Access (No AWS account required)
    aws s3 ls s3://sorel-20m/ --no-sign-request

Edit this dataset entry on GitHub

Home