Malware Classification Using Static Disassembly and Machine Learning
- URL: http://arxiv.org/abs/2201.07649v1
- Date: Fri, 10 Dec 2021 18:14:47 GMT
- Title: Malware Classification Using Static Disassembly and Machine Learning
- Authors: Zhenshuo Chen, Eoin Brophy, Tomas Ward
- Abstract summary: We propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content, and import libraries, to classify malware families.
Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware.
We show that the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40%.
- Score: 1.5469452301122177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Network and system security are incredibly critical issues now. Due to the
rapid proliferation of malware, traditional analysis methods struggle with
enormous samples.
In this paper, we propose four easy-to-extract and small-scale features,
including sizes and permissions of Windows PE sections, content complexity, and
import libraries, to classify malware families, and use automatic machine
learning to search for the best model and hyper-parameters for each feature and
their combinations. Compared with detailed behavior-related features like API
sequences, proposed features provide macroscopic information about malware. The
analysis is based on static disassembly scripts and hexadecimal machine code.
Unlike dynamic behavior analysis, static analysis is resource-efficient and
offers complete code coverage, but is vulnerable to code obfuscation and
encryption.
The results demonstrate that features which work well in dynamic analysis are
not necessarily effective when applied to static analysis. For instance, API
4-grams only achieve 57.96% accuracy and involve a relatively high dimensional
feature set (5000 dimensions). In contrast, the novel proposed features
together with a classical machine learning algorithm (Random Forest) presents
very good accuracy at 99.40% and the feature vector is of much smaller
dimension (40 dimensions). We demonstrate the effectiveness of this approach
through integration in IDA Pro, which also facilitates the collection of new
training samples and subsequent model retraining.
Related papers
- Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions [2.243674903279612]
State-of-the-art machine learning techniques can predict functions with possible security vulnerabilities in JavaScript programs.
Best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76.
Deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70.
arXiv Detail & Related papers (2024-05-12T08:23:42Z) - E&V: Prompting Large Language Models to Perform Static Analysis by
Pseudo-code Execution and Verification [7.745665775992235]
Large Language Models (LLMs) offer new capabilities for software engineering tasks.
LLMs simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort.
E&V includes a verification process for pseudo-code execution without needing an external oracle.
arXiv Detail & Related papers (2023-12-13T19:31:00Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods [2.9248916859490173]
We show that high detection accuracies can be achieved using features extracted through static analysis alone.
Random forests are generally the most effective model, outperforming more complex deep learning approaches.
arXiv Detail & Related papers (2023-01-30T10:48:10Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and Behavioral Malware Representations [5.439020425819001]
We propose a hybrid machine learning architecture that simultaneously employs multiple deep learning models.
We report an improved detection rate, above the capabilities of the current state-of-the-art model.
arXiv Detail & Related papers (2022-08-20T05:30:16Z) - ALBench: A Framework for Evaluating Active Learning in Object Detection [102.81795062493536]
This paper contributes an active learning benchmark framework named as ALBench for evaluating active learning in object detection.
Developed on an automatic deep model training system, this ALBench framework is easy-to-use, compatible with different active learning algorithms, and ensures the same training and testing protocols.
arXiv Detail & Related papers (2022-07-27T07:46:23Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification.
Our strategy enables important aspects of the base learner objective to be learned during meta-training.
We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z) - Simple and Effective VAE Training with Calibrated Decoders [123.08908889310258]
Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions.
We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution.
We propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically.
arXiv Detail & Related papers (2020-06-23T17:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.