CAFE-GB: Scalable and Stable Feature Selection for Malware Detection via Chunk-wise Aggregated Gradient Boosting
- URL: http://arxiv.org/abs/2601.15754v1
- Date: Thu, 22 Jan 2026 08:43:15 GMT
- Title: CAFE-GB: Scalable and Stable Feature Selection for Malware Detection via Chunk-wise Aggregated Gradient Boosting
- Authors: Ajvad Haneef K, Karan Kuwar Singh, Madhu Kumar S D,
- Abstract summary: This paper proposes a scalable feature selection framework designed to produce stable and globally consistent feature rankings for high-dimensional malware detection.<n>CEFA-GB partitions training data into overlapping chunks, estimates local feature importance using gradient boosting models, and aggregates these estimates.<n>It achieves performance parity with full-feature baselines across multiple metrics, including Accuracy, F1-score, ROC-AUC and PR-AUC, while reducing feature dimensionality by more than 95%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional malware datasets often exhibit feature redundancy, instability, and scalability limitations, which hinder the effectiveness and interpretability of machine learning-based malware detection systems. Although feature selection is commonly employed to mitigate these issues, many existing approaches lack robustness when applied to large-scale and heterogeneous malware data. To address this gap, this paper proposes CAFE-GB (Chunk-wise Aggregated Feature Estimation using Gradient Boosting), a scalable feature selection framework designed to produce stable and globally consistent feature rankings for high-dimensional malware detection. CAFE-GB partitions training data into overlapping chunks, estimates local feature importance using gradient boosting models, and aggregates these estimates to derive a robust global ranking. Feature budget selection is performed separately through a systematic k-selection and stability analysis to balance detection performance and robustness. The proposed framework is evaluated on two large-scale malware datasets: BODMAS and CIC-AndMal2020, representing large and diverse malware feature spaces. Experimental results show that classifiers trained on CAFE-GB -selected features achieve performance parity with full-feature baselines across multiple metrics, including Accuracy, F1-score, MCC, ROC-AUC, and PR-AUC, while reducing feature dimensionality by more than 95\%. Paired Wilcoxon signed-rank tests confirm that this reduction does not introduce statistically significant performance degradation. Additional analyses demonstrate low inter-feature redundancy and improved interpretability through SHAP-based explanations. Runtime and memory profiling further indicate reduced downstream classification overhead. Overall, CAFE-GB provides a stable, interpretable, and scalable feature selection strategy for large-scale malware detection.
Related papers
- Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection [64.0168648353038]
3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability.<n>Current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity.<n>We propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection.
arXiv Detail & Related papers (2025-10-19T14:56:38Z) - Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z) - Towards a more realistic evaluation of machine learning models for bearing fault diagnosis [0.28873930745906956]
This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation.<n>We propose a leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing.<n>We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAF)
arXiv Detail & Related papers (2025-09-26T12:35:02Z) - Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset [0.0]
This study investigates the effectiveness of several machine learning algorithms for static malware detection using the EMBER dataset.<n>We evaluate eight classification models: LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, HistGradientBoosting, k-Nearest Neighbors (KNN), and TabNet.<n>The models are assessed on accuracy, precision, recall, F1 score, and AUC to examine both predictive performance and robustness.
arXiv Detail & Related papers (2025-07-22T18:45:10Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models [13.157596316463621]
Test-time adaptation (TTA) of visual language models has attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks.<n>We introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called "Cache, Residual, Gaussian" (CRG)<n> Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.
arXiv Detail & Related papers (2025-03-24T04:32:35Z) - A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.<n>We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.<n>By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z) - Ransomware detection using stacked autoencoder for feature selection [0.0]
The study meticulously analyzes the autoencoder's learned weights and activations to identify essential features for distinguishing ransomware families from other malware.
The proposed model achieves an exceptional 99% accuracy in ransomware classification, surpassing the Extreme Gradient Boosting (XGBoost) algorithm.
arXiv Detail & Related papers (2024-02-17T17:31:48Z) - On the Efficacy of Generalization Error Prediction Scoring Functions [33.24980750651318]
Generalization error predictors (GEPs) aim to predict model performance on unseen distributions by deriving dataset-level error estimates from sample-level scores.
We rigorously study the effectiveness of popular scoring functions (confidence, local manifold smoothness, model agreement) independent of mechanism choice.
arXiv Detail & Related papers (2023-03-23T18:08:44Z) - Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Estimating Structural Target Functions using Machine Learning and
Influence Functions [103.47897241856603]
We propose a new framework for statistical machine learning of target functions arising as identifiable functionals from statistical models.
This framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics.
We put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information.
arXiv Detail & Related papers (2020-08-14T16:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.