Related papers: Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Hard-to-Learn Data

Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Hard-to-Learn Data

URL: http://arxiv.org/abs/2506.20444v2
Date: Fri, 15 Aug 2025 19:44:53 GMT
Title: Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Hard-to-Learn Data
Authors: Xiang Lan, Tim Menzies, Bowen Xu,
Abstract summary: Vulnerability detection is crucial for identifying security weaknesses in software systems.<n>This paper proposes a novel method to significantly enhance the active learning process by using dataset maps.<n>Our approach systematically identifies samples that are hard-to-learn for a model and integrates this information to create a more sophisticated sample selection strategy.
Score: 15.490968013867562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vulnerability detection is crucial for identifying security weaknesses in software systems. However, training effective machine learning models for this task is often constrained by the high cost and expertise required for data annotation. Active learning is a promising approach to mitigate this challenge by intelligently selecting the most informative data points for labeling. This paper proposes a novel method to significantly enhance the active learning process by using dataset maps. Our approach systematically identifies samples that are hard-to-learn for a model and integrates this information to create a more sophisticated sample selection strategy. Unlike traditional active learning methods that focus primarily on model uncertainty, our strategy enriches the selection process by considering learning difficulty, allowing the active learner to more effectively pinpoint truly informative examples. The experimental results show that our approach can improve F1 score over random selection by 61.54% (DeepGini) and 45.91% (K-Means) and outperforms standard active learning by 8.23% (DeepGini) and 32.65% (K-Means) for CodeBERT on the Big-Vul dataset, demonstrating the effectiveness of integrating dataset maps for optimizing sample selection in vulnerability detection. Furthermore, our approach also enhances model robustness, improves sample selection by filtering hard-to-learn data, and stabilizes active learning performance across iterations. By analyzing the characteristics of these outliers, we provide insights for future improvements in dataset construction, making vulnerability detection more reliable and cost-effective.

Related papers

Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration [6.7181844004432385]
We introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the Active Learning framework.<n>We demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques.
arXiv Detail & Related papers (2025-07-29T06:08:28Z)
Z-Error Loss for Training Neural Networks [0.0]
Outliers introduce significant training challenges in neural networks by propagating erroneous gradients, which can degrade model performance and generalization.<n>We propose the Z-Error Loss, a statistically principled approach that minimizes outlier influence during training by masking the contribution of data points identified as out-of-distribution within each batch.
arXiv Detail & Related papers (2025-06-02T18:35:30Z)
Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining [36.772769830368475]
This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments.<n>Results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy.
arXiv Detail & Related papers (2025-04-05T02:55:44Z)
Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection [2.7554677967598047]
adversarially robust learning is widely recognized to demand significantly more training examples.<n>Recent works propose the use of self-supervised adversarial training with external or synthetically generated unlabeled data to enhance model robustness.<n>We propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement.
arXiv Detail & Related papers (2025-01-15T15:47:49Z)
Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z)
DRoP: Distributionally Robust Data Pruning [11.930434318557156]
We conduct the first systematic study of the impact of data pruning on classification bias of trained models.<n>We propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
arXiv Detail & Related papers (2024-04-08T14:55:35Z)
Compute-Efficient Active Learning [0.0]
Active learning aims at reducing labeling costs by selecting the most informative samples from an unlabeled dataset. Traditional active learning process often demands extensive computational resources, hindering scalability and efficiency. We present a novel method designed to alleviate the computational burden associated with active learning on massive datasets.
arXiv Detail & Related papers (2024-01-15T12:32:07Z)
Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes [72.75421975804132]
Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. We propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives.
arXiv Detail & Related papers (2023-09-11T14:16:37Z)
MAPS: A Noise-Robust Progressive Learning Approach for Source-Free Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation. This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z)
SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation. We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z)
Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z)
Frugal Reinforcement-based Active Learning [12.18340575383456]
We propose a novel active learning approach for label-efficient training. The proposed method is iterative and aims at minimizing a constrained objective function that mixes diversity, representativity and uncertainty criteria. We also introduce a novel weighting mechanism based on reinforcement learning, which adaptively balances these criteria at each training iteration.
arXiv Detail & Related papers (2022-12-09T14:17:45Z)
Responsible Active Learning via Human-in-the-loop Peer Study [88.01358655203441]
We propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability. We first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side. During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion.
arXiv Detail & Related papers (2022-11-24T13:18:27Z)
ALLSH: Active Learning Guided by Local Sensitivity and Hardness [98.61023158378407]
We propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function. Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks.
arXiv Detail & Related papers (2022-05-10T15:39:11Z)
Towards Reducing Labeling Cost in Deep Object Detection [61.010693873330446]
We propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector. Our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift.
arXiv Detail & Related papers (2021-06-22T16:53:09Z)
Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z)
Auto-weighted Robust Federated Learning with Corrupted Data Sources [7.475348174281237]
Federated learning provides a communication-efficient and privacy-preserving training process. Standard federated learning techniques that naively minimize an average loss function are vulnerable to data corruptions. We propose Auto-weighted Robust Federated Learning (arfl) to provide robustness against corrupted data sources.
arXiv Detail & Related papers (2021-01-14T21:54:55Z)
Bayesian Active Learning for Wearable Stress and Affect Detection [0.7106986689736827]
Stress detection using on-device deep learning algorithms has been on the rise owing to advancements in pervasive computing. In this paper, we propose a framework with capabilities to represent model uncertainties through approximations in Bayesian Neural Networks. Our proposed framework achieves a considerable efficiency boost during inference, with a substantially low number of acquired pool points.
arXiv Detail & Related papers (2020-12-04T16:19:37Z)
Learning to Rank for Active Learning: A Listwise Approach [36.72443179449176]
Active learning emerged as an alternative to alleviate the effort to label huge amount of data for data hungry applications. In this work, we rethink the structure of the loss prediction module, using a simple but effective listwise approach. Experimental results on four datasets demonstrate that our method outperforms recent state-of-the-art active learning approaches for both image classification and regression tasks.
arXiv Detail & Related papers (2020-07-31T21:05:16Z)
Adversarial Self-Supervised Contrastive Learning [62.17538130778111]
Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions. We propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples. We present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data.
arXiv Detail & Related papers (2020-06-13T08:24:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.