When Less is Enough: Positive and Unlabeled Learning Model for
Vulnerability Detection
- URL: http://arxiv.org/abs/2308.10523v1
- Date: Mon, 21 Aug 2023 07:21:38 GMT
- Title: When Less is Enough: Positive and Unlabeled Learning Model for
Vulnerability Detection
- Authors: Xin-Cheng Wen, Xinchen Wang, Cuiyun Gao, Shaohua Wang, Yang Liu,
Zhaoquan Gu
- Abstract summary: This paper focuses on the Positive and Unlabeled (PU) learning problem for vulnerability detection.
We propose a novel model named PILOT, i.e., PositIve and unlabeled Learning mOdel for vulnerability deTection.
- Score: 18.45462960578864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated code vulnerability detection has gained increasing attention in
recent years. The deep learning (DL)-based methods, which implicitly learn
vulnerable code patterns, have proven effective in vulnerability detection. The
performance of DL-based methods usually relies on the quantity and quality of
labeled data. However, the current labeled data are generally automatically
collected, such as crawled from human-generated commits, making it hard to
ensure the quality of the labels. Prior studies have demonstrated that the
non-vulnerable code (i.e., negative labels) tends to be unreliable in
commonly-used datasets, while vulnerable code (i.e., positive labels) is more
determined. Considering the large numbers of unlabeled data in practice, it is
necessary and worth exploring to leverage the positive data and large numbers
of unlabeled data for more accurate vulnerability detection.
In this paper, we focus on the Positive and Unlabeled (PU) learning problem
for vulnerability detection and propose a novel model named PILOT, i.e.,
PositIve and unlabeled Learning mOdel for vulnerability deTection. PILOT only
learns from positive and unlabeled data for vulnerability detection. It mainly
contains two modules: (1) A distance-aware label selection module, aiming at
generating pseudo-labels for selected unlabeled data, which involves the
inter-class distance prototype and progressive fine-tuning; (2) A
mixed-supervision representation learning module to further alleviate the
influence of noise and enhance the discrimination of representations.
Related papers
- ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data [5.938113434208745]
Supervised learning-based software vulnerability detectors often fall short due to the inadequate availability of labelled training data.
In this paper, we explore a different approach by reframing vulnerability detection as one of anomaly detection.
Our approach achieves $1.62times$ to $2.18times$ better Top-5 accuracies and $1.02times$ to $1.29times$ times better ROC scores on line-level vulnerability detection tasks.
arXiv Detail & Related papers (2024-08-28T03:28:17Z) - Improving Label Error Detection and Elimination with Uncertainty Quantification [5.184615738004059]
We develop novel, model-agnostic algorithms for Uncertainty Quantification-Based Label Error Detection (UQ-LED)
Our UQ-LED algorithms outperform state-of-the-art confident learning in identifying label errors.
We propose a novel approach to generate realistic, class-dependent label errors synthetically.
arXiv Detail & Related papers (2024-05-15T15:17:52Z) - FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness
for Semi-Supervised Learning [73.13448439554497]
Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data.
Most SSL methods are commonly based on instance-wise consistency between different data transformations.
We propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets.
arXiv Detail & Related papers (2023-10-25T06:57:59Z) - Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Robust Positive-Unlabeled Learning via Noise Negative Sample
Self-correction [48.929877651182885]
Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature.
We propose a new robust PU learning method with a training strategy motivated by the nature of human learning.
arXiv Detail & Related papers (2023-08-01T04:34:52Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Robust Deep Semi-Supervised Learning: A Brief Introduction [63.09703308309176]
Semi-supervised learning (SSL) aims to improve learning performance by leveraging unlabeled data when labels are insufficient.
SSL with deep models has proven to be successful on standard benchmark tasks.
However, they are still vulnerable to various robustness threats in real-world applications.
arXiv Detail & Related papers (2022-02-12T04:16:41Z) - A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels [49.990938653249415]
This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data.
Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-03-08T11:46:02Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.