Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy
- URL: http://arxiv.org/abs/2506.16224v1
- Date: Thu, 19 Jun 2025 11:32:53 GMT
- Title: Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy
- Authors: Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra,
- Abstract summary: This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification.<n>By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms.
- Score: 0.7999703756441756
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.
Related papers
- OpCode-Based Malware Classification Using Machine Learning and Deep Learning Techniques [0.0]
This report presents a comprehensive analysis of malware classification using OpCode sequences.<n>Two distinct approaches are evaluated: traditional machine learning using n-gram analysis with Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers; and a deep learning approach employing a Convolutional Neural Network (CNN)
arXiv Detail & Related papers (2025-04-18T02:09:57Z) - k-fold Subsampling based Sequential Backward Feature Elimination [11.640238391159118]
This algorithm is a hybrid feature selection approach combining the benefits of filter and wrapper methods.<n>It can improve the detection speed of the SVM classifier by over 50% with up to 2% better detection accuracy.<n>Our algorithm also outperforms the equivalent systems introduced in the deformable part model approach with around 9% improvement in the detection accuracy.
arXiv Detail & Related papers (2025-03-14T23:10:08Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data.
We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - A concise method for feature selection via normalized frequencies [0.0]
In this paper, a concise method is proposed for universal feature selection.
The proposed method uses a fusion of the filter method and the wrapper method, rather than a combination of them.
The evaluation results show that the proposed method outperformed several state-of-the-art related works in terms of accuracy, precision, recall, F-score and AUC.
arXiv Detail & Related papers (2021-06-10T15:29:54Z) - Hybrid Method Based on NARX models and Machine Learning for Pattern
Recognition [0.0]
This work presents a novel technique that integrates the methodologies of machine learning and system identification to solve multiclass problems.
The efficiency of the method was tested by running case studies investigated in machine learning, obtaining better absolute results when compared with classical classification algorithms.
arXiv Detail & Related papers (2021-06-08T00:17:36Z) - Optimizing Speech Emotion Recognition using Manta-Ray Based Feature
Selection [1.4502611532302039]
We show that concatenation of features, extracted by using different existing feature extraction methods can boost the classification accuracy.
We also perform a novel application of Manta Ray optimization in speech emotion recognition tasks that resulted in a state-of-the-art result.
arXiv Detail & Related papers (2020-09-18T16:09:34Z) - Information Theoretic Meta Learning with Gaussian Processes [74.54485310507336]
We formulate meta learning using information theoretic concepts; namely, mutual information and the information bottleneck.
By making use of variational approximations to the mutual information, we derive a general and tractable framework for meta learning.
arXiv Detail & Related papers (2020-09-07T16:47:30Z) - Strong Generalization and Efficiency in Neural Programs [69.18742158883869]
We study the problem of learning efficient algorithms that strongly generalize in the framework of neural program induction.
By carefully designing the input / output interfaces of the neural model and through imitation, we are able to learn models that produce correct results for arbitrary input sizes.
arXiv Detail & Related papers (2020-07-07T17:03:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.