Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?
- URL: http://arxiv.org/abs/2407.17803v1
- Date: Thu, 25 Jul 2024 06:22:25 GMT
- Title: Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?
- Authors: Triet H. M. Le, M. Ali Babar,
- Abstract summary: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well.
We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.
Related papers
- Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help? [0.0]
We show that mitigating data imbalance can significantly improve the predictive performance of models for all the Common Vulnerability Scoring System (CVSS) tasks.
We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board.
arXiv Detail & Related papers (2024-07-15T13:47:55Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Are Latent Vulnerabilities Hidden Gems for Software Vulnerability
Prediction? An Empirical Study [4.830367174383139]
latent vulnerable functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions.
Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs.
arXiv Detail & Related papers (2024-01-20T03:36:01Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Semi-supervised Variational Autoencoder for Regression: Application on
Soft Sensors [0.0]
We motivate the use of semi-supervised learning considering the fact that process quality variables are not collected at the same frequency as other process variables.
These unlabelled records are not possible to use for training quality variable predictions based on supervised learning methods.
We extend this approach of supervised VAEs for regression (SVAER) to make it learn from unlabelled data leading to semi-supervised VAEs for regression (SSVAER)
arXiv Detail & Related papers (2022-11-11T02:54:59Z) - On the Use of Fine-grained Vulnerable Code Statements for Software
Vulnerability Assessment Models [0.0]
We use large-scale data from 1,782 functions of 429 SVs in 200 real-world projects to develop Machine Learning models for function-level SV assessment tasks.
We show that vulnerable statements are 5.8 times smaller in size, yet exhibit 7.5-114.5% stronger assessment performance.
arXiv Detail & Related papers (2022-03-16T06:29:40Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - Improving Limited Labeled Dialogue State Tracking with Self-Supervision [91.68515201803986]
Existing dialogue state tracking (DST) models require plenty of labeled data.
We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior.
Our proposed self-supervised signals can improve joint goal accuracy by 8.95% when only 1% labeled data is used.
arXiv Detail & Related papers (2020-10-26T21:57:42Z) - Semi-Automatic Data Annotation guided by Feature Space Projection [117.9296191012968]
We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation.
We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities.
Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.
arXiv Detail & Related papers (2020-07-27T17:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.