A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.
At most annually
See all datasets managed by Sophos AI.
Sophos/ReversingLabs 20 Million malware detection dataset was accessed on
DATE from https://registry.opendata.aws/sorel-20m.
aws s3 ls --no-sign-request s3://sorel-20m/