Improving Malware Classifiers with Plausible Novel Samples

Recent growth and proliferation of malware has tested practitioners’ ability to promptly classify new samples. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing malware classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as new malware samples arise that are beyond the scope of the training set, additional reverse engineering effort is needed to update the training set, and the sheer volume of new samples found in the wild creates a significant burden for labeling enough malware to adequately train modern classifiers.

To address this problem, we propose a three-pronged approach. First, we will leverage data mixing techniques to generate novel and plausibly realistic malware samples by mixing feature representations of pairs of malware samples. In contrast to using rudimentary perturbation techniques, our approach will generate novel samples that correspond to malware samples that reflect plausible malicious binaries using semantics-aware augmentation. Second, we will leverage neural network verification techniques for analyzing and improving classification robustness that ensures a specific level of coverage in the input and feature spaces of malware binaries. This will enable improved classification boundaries in the feature space, resulting in more accurate malware classification. Third, we will develop a search-based malware evolution engine that generates additional novel malicious binaries to incorporate during training. While existing data augmentation techniques work in the feature space or upon an abstract representation, the augmented samples do not necessarily correspond to real, functioning binaries.  Thus, we will leverage automated program repair techniques to generate new malware samples by guiding an evolutionary search to evade classification for the purposes of improving malware training data.

Award Number
Lead PI
Kevin Leach
Taylor Johnson