Description
EMBER is an open source dataset, published in 2018, for the Windows Portable Executable file format. The Windows Portable Executable (PE)is a file format for Windows based executables, object code, DLLs. The Portable Executable file contains all of the necessary information for the Windows Operating System to manage, parse and execute the code contained within. The PE file performs the same function as an Executable and Linkable Format (ELF) on Linux or a Mach-O file in macOS and iOS.
The ember database is built using the following allocations as outlined below:
Training Samples (900k) |
300k Malicious |
300k Benign |
300k Unlabeled |
Test Samples |
100K Malicious |
100k Benign |
Features
The Ember dataset consists of a comprehensive set of both Raw features as well as vectorized features. The raw features are extracted directly from the dataset while the vectorized features are derived from the data set. The data can be broken into Parsed features and Format-agnostic features.
Parsed Features
General File Information is a parsed feature which includes some general information such as file size, PE Header details (e.g., virtual size, number of imported and exported functions, debug section present, thread local storage, resources, relocations, signature, number of symbols).
Header information is extracted from the PE File COFF header (e.g., timestamp, target machine, list of image characteristics).
Imported functions are parsed to extract the listing of functions which are imported by the PE file
Exported functions are also parsed out of the PE file and added into the data set.
Section Information is extracted for each section building a dataset including: Name, size, entropy, virtual size, list of strings.
Format-agnostic features
Byte Histogram: This set extracts each byte from the binary and creates a histogram of each of the 256 possible integer values, representing the counts of each byte value.
Byte-Entropy Histogram: Creation of a byte entropy histogram which approximates the joint distribution of [(H,X) of entropy H and byte value X.
String Information: Simple statistics about printable strings. Specifically the following is reported: Number of Strings, Average Length, Histogram of printable characters, entropy of characters across all printable strings.
Advantages
This dataset includes both benign samples as well as malicious samples while prior data sets only included malicious samples. This is an important feature in a dataset as if only malicious samples are included it would make training exceedingly difficult and prone to having a high false positive rate.
Disadvantages
The EMBER dataset is a features only data set which does not include the raw binaries in which limits the extraction of new features or limiting experiments using featureless deep-learning algorithms.
The original paper describing this data set can be found at: https://arxiv.org/abs/1804.04637