Related papers: Efficient Malware Detection with Optimized Learning on High-Dimensional Features

Efficient Malware Detection with Optimized Learning on High-Dimensional Features

URL: http://arxiv.org/abs/2506.17309v1
Date: Wed, 18 Jun 2025 06:56:59 GMT
Title: Efficient Malware Detection with Optimized Learning on High-Dimensional Features
Authors: Aditya Choudhary, Sarthak Pawar, Yashodhara Haribhakta,
Abstract summary: Malware detection using machine learning requires feature extraction from binary files.<n>A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors.<n>This study addresses these challenges by applying two dimensionality reduction techniques.
Score: 1.3654846342364308
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures generalization and avoids dataset bias. Experimental results show that LightGBM trained on the 384-dimensional feature set after XGBoost feature selection achieves the highest accuracy of 97.52% on the unified dataset, providing an optimal balance between computational efficiency and detection performance. The best model, trained in 61 minutes using 30 GB of RAM and 19.5 GB of disk space, generalizes effectively to completely unseen datasets, maintaining 95.31% accuracy on TRITIUM and 93.98% accuracy on INFERNO. These findings present a scalable, compute-efficient approach for malware detection without compromising accuracy.

Related papers

Adaptive Malware Detection using Sequential Feature Selection: A Dueling Double Deep Q-Network (D3QN) Framework for Intelligent Classification [1.4120905648647635]
We formulate malware classification as a Markov Decision Process with episodic feature acquisition.<n>We propose a Dueling Double Deep Q-Network (D3QN) framework for adaptive sequential feature selection.<n>We evaluate our approach on Microsoft Big2015 (9-class, 1,795 features) and BODMAS (binary, 2,381 features) datasets.
arXiv Detail & Related papers (2025-07-06T12:37:50Z)
LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLMs [0.0]
We show how Linear Probes can be used to provide an estimation on the performance of a compressed large language model.<n>We also show their suitability to set the cut-off point when applying layer pruning compression.<n>Our approach, dubbed $LPASS$, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples.
arXiv Detail & Related papers (2025-05-30T10:37:14Z)
Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection.<n>LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors.<n>Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z)
Malware Classification from Memory Dumps Using Machine Learning, Transformers, and Large Language Models [1.038088229789127]
This study investigates the performance of various classification models for a malware classification task using different feature sets and data configurations.<n>XGB achieved the highest accuracy of 87.42% using the Top 45 Features, outperforming all other models.<n>Deep learning models underperformed, with RNN achieving 66.71% accuracy and Transformers reaching 71.59%.
arXiv Detail & Related papers (2025-03-04T00:24:21Z)
Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z)
The object detection model uses combined extraction with KNN and RF classification [0.0]
This study contributes to the field of object detection with a new approach combining GLCM and LBP as feature vectors as well as VE for classification. System testing used a dataset of 4,437 2D images, the results for KNN accuracy were 92.7% and F1-score 92.5%, while RF performance was lower.
arXiv Detail & Related papers (2024-05-09T05:21:42Z)
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets. We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z)
KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection [48.66703222700795]
We resort to a novel kernel strategy to identify the most informative point clouds to acquire labels. To accommodate both one-stage (i.e., SECOND) and two-stage detectors, we incorporate the classification entropy tangent and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art method.
arXiv Detail & Related papers (2023-07-16T04:27:03Z)
FDINet: Protecting against DNN Model Extraction via Feature Distortion Index [25.69643512837956]
FDINET is a novel defense mechanism that leverages the feature distribution of deep neural network (DNN) models. It exploits FDI similarity to identify colluding adversaries from distributed extraction attacks. FDINET exhibits the capability to identify colluding adversaries with an accuracy exceeding 91%.
arXiv Detail & Related papers (2023-06-20T07:14:37Z)
EAutoDet: Efficient Architecture Search for Object Detection [110.99532343155073]
EAutoDet framework can discover practical backbone and FPN architectures for object detection in 1.4 GPU-days. We propose a kernel reusing technique by sharing the weights of candidate operations on one edge and consolidating them into one convolution. In particular, the discovered architectures surpass state-of-the-art object detection NAS methods and achieve 40.1 mAP with 120 FPS and 49.2 mAP with 41.3 FPS on COCO test-dev set.
arXiv Detail & Related papers (2022-03-21T05:56:12Z)
Sample and Computation Redistribution for Efficient Face Detection [137.19388513633484]
Training data sampling and computation distribution strategies are the keys to efficient and accurate face detection. scrfdf34 outperforms the best competitor, TinaFace, by $3.86%$ (AP at hard set) while being more than emph3$times$ faster on GPUs with VGA-resolution images.
arXiv Detail & Related papers (2021-05-10T23:51:14Z)
DrNAS: Dirichlet Neural Architecture Search [88.56953713817545]
We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based generalization. To alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme.
arXiv Detail & Related papers (2020-06-18T08:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.