Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?
- URL: http://arxiv.org/abs/2407.10722v1
- Date: Mon, 15 Jul 2024 13:47:55 GMT
- Title: Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?
- Authors: Triet H. M. Le, M. Ali Babar,
- Abstract summary: We show that mitigating data imbalance can significantly improve the predictive performance of models for all the Common Vulnerability Scoring System (CVSS) tasks.
We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.
Related papers
- SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard Examples [54.760757107700755]
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance.
The class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation.
We propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi)
arXiv Detail & Related papers (2025-01-10T14:35:16Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT [0.0]
We conduct an empirical study to evaluate the impact of SV data scarcity in emerging languages on the state-of-the-art SV prediction model.
We train and test the state-of-the-art model based on CodeBERT with and without data sampling techniques for function-level and line-level SV prediction.
arXiv Detail & Related papers (2024-04-26T01:57:12Z) - Are Latent Vulnerabilities Hidden Gems for Software Vulnerability
Prediction? An Empirical Study [4.830367174383139]
latent vulnerable functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions.
Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs.
arXiv Detail & Related papers (2024-01-20T03:36:01Z) - Robust Survival Analysis with Adversarial Regularization [6.001304967469112]
Survival Analysis (SA) models the time until an event occurs.
Recent work shows that Neural Networks (NNs) can capture complex relationships in SA.
We leverage NN verification advances to create algorithms for robust, fully-parametric survival models.
arXiv Detail & Related papers (2023-12-26T12:18:31Z) - DSV: An Alignment Validation Loss for Self-supervised Outlier Model
Selection [23.253175824487652]
Self-supervised learning (SSL) has proven effective in solving various problems by generating internal supervisory signals.
Unsupervised anomaly detection, which faces the high cost of obtaining true labels, is an area that can greatly benefit from SSL.
We propose DSV (Discordance and Separability Validation), an unsupervised validation loss to select high-performing detection models with effective augmentation HPs.
arXiv Detail & Related papers (2023-07-13T02:45:29Z) - On the Use of Fine-grained Vulnerable Code Statements for Software
Vulnerability Assessment Models [0.0]
We use large-scale data from 1,782 functions of 429 SVs in 200 real-world projects to develop Machine Learning models for function-level SV assessment tasks.
We show that vulnerable statements are 5.8 times smaller in size, yet exhibit 7.5-114.5% stronger assessment performance.
arXiv Detail & Related papers (2022-03-16T06:29:40Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - A Principled Approach to Data Valuation for Federated Learning [73.19984041333599]
Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources.
The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data value notion.
This paper proposes a variant of the SV amenable to FL, which we call the federated Shapley value.
arXiv Detail & Related papers (2020-09-14T04:37:54Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.