Related papers: Malware Classification Using Static Disassembly and Machine Learning

Malware Classification Using Static Disassembly and Machine Learning

URL: http://arxiv.org/abs/2201.07649v1
Date: Fri, 10 Dec 2021 18:14:47 GMT
Title: Malware Classification Using Static Disassembly and Machine Learning
Authors: Zhenshuo Chen, Eoin Brophy, Tomas Ward
Abstract summary: We propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content, and import libraries, to classify malware families. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. We show that the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40%.
Score: 1.5469452301122177
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Network and system security are incredibly critical issues now. Due to the rapid proliferation of malware, traditional analysis methods struggle with enormous samples. In this paper, we propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content complexity, and import libraries, to classify malware families, and use automatic machine learning to search for the best model and hyper-parameters for each feature and their combinations. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. The analysis is based on static disassembly scripts and hexadecimal machine code. Unlike dynamic behavior analysis, static analysis is resource-efficient and offers complete code coverage, but is vulnerable to code obfuscation and encryption. The results demonstrate that features which work well in dynamic analysis are not necessarily effective when applied to static analysis. For instance, API 4-grams only achieve 57.96% accuracy and involve a relatively high dimensional feature set (5000 dimensions). In contrast, the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40% and the feature vector is of much smaller dimension (40 dimensions). We demonstrate the effectiveness of this approach through integration in IDA Pro, which also facilitates the collection of new training samples and subsequent model retraining.

Related papers

Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction [8.056137513320065]
This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations.
arXiv Detail & Related papers (2025-04-18T13:13:39Z)
Imbalanced malware classification: an approach based on dynamic classifier selection [0.0]
A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications.
arXiv Detail & Related papers (2025-03-30T19:12:16Z)
A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy. We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods. By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z)
SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines [12.940071285118451]
academia focuses on combining static and dynamic analysis within a single or ensemble of models. In this paper, we investigate the properties of malware detectors built with multiple and different types of analysis. As far as we know, we are the first to investigate the properties of sequential malware detectors, shedding light on their behavior in real production environment.
arXiv Detail & Related papers (2024-05-23T12:06:10Z)
Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions [2.243674903279612]
State-of-the-art machine learning techniques can predict functions with possible security vulnerabilities in JavaScript programs. Best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76. Deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70.
arXiv Detail & Related papers (2024-05-12T08:23:42Z)
E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification [7.745665775992235]
Large Language Models (LLMs) offer new capabilities for software engineering tasks. LLMs simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort. E&V includes a verification process for pseudo-code execution without needing an external oracle.
arXiv Detail & Related papers (2023-12-13T19:31:00Z)
Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each) We train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z)
Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods [2.9248916859490173]
We show that high detection accuracies can be achieved using features extracted through static analysis alone. Random forests are generally the most effective model, outperforming more complex deep learning approaches.
arXiv Detail & Related papers (2023-01-30T10:48:10Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)
Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and Behavioral Malware Representations [5.439020425819001]
We propose a hybrid machine learning architecture that simultaneously employs multiple deep learning models. We report an improved detection rate, above the capabilities of the current state-of-the-art model.
arXiv Detail & Related papers (2022-08-20T05:30:16Z)
ALBench: A Framework for Evaluating Active Learning in Object Detection [102.81795062493536]
This paper contributes an active learning benchmark framework named as ALBench for evaluating active learning in object detection. Developed on an automatic deep model training system, this ALBench framework is easy-to-use, compatible with different active learning algorithms, and ensures the same training and testing protocols.
arXiv Detail & Related papers (2022-07-27T07:46:23Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification. Our strategy enables important aspects of the base learner objective to be learned during meta-training. We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels. To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z)
Simple and Effective VAE Training with Calibrated Decoders [123.08908889310258]
Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions. We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution. We propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically.
arXiv Detail & Related papers (2020-06-23T17:57:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.