R+R: Security Vulnerability Dataset Quality Is Critical
- URL: http://arxiv.org/abs/2503.06387v1
- Date: Sun, 09 Mar 2025 01:49:30 GMT
- Title: R+R: Security Vulnerability Dataset Quality Is Critical
- Authors: Anurag Swarnim Yadav, Joseph N. Wilson,
- Abstract summary: A number of studies have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples.<n>Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete.<n>We employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data.
- Score: 0.6906005491572401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are of great interest in vulnerability detection and repair. The effectiveness of these models hinges on the quality of the datasets used for both training and evaluation. Our investigation reveals that a number of studies featured in prominent software engineering conferences have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples. Using these datasets for experimentation will yield incorrect results that are significantly different from actual expected behavior. For example, the state-of-the-art VulRepair Model, which is reported to have 44% accuracy, on average yielded 9% accuracy when test-set duplicates were removed from its training set and 13% accuracy when training-set duplicates were removed from its test set. In an effort to tackle these data quality concerns, we have retrained models from several papers without duplicates and conducted an accuracy assessment of labels for the top ten most hazardous Common Weakness Enumerations (CWEs). Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete. Finally, we employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data, leading us to conclude that while previous studies have over-estimated performance due to poor dataset quality, this does not demonstrate that better performance is not possible.
Related papers
- Reducing false positives in strong lens detection through effective augmentation and ensemble learning [0.0]
This research studies the impact of high-quality training datasets on the performance of Convolutional Neural Networks (CNNs) in detecting strong gravitational lenses.<n>We stress the importance of data diversity and representativeness, demonstrating how variations in sample populations influence CNN performance.
arXiv Detail & Related papers (2025-02-20T11:50:56Z) - Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models [0.03724049002462992]
The study analyzed two healthcare datasets: the high-quality MIMIC-III public hospital dataset and a lower-quality private dataset from Australian aged care homes.<n>Mixtral correctly detected errors in 63% of progress notes, with 17% containing a single token misclassified due to medical terminology.
arXiv Detail & Related papers (2025-02-12T00:27:49Z) - Is Training Data Quality or Quantity More Impactful to Small Language Model Performance? [0.0]
This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs)
Training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large.
arXiv Detail & Related papers (2024-11-24T12:51:50Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Boosting Facial Expression Recognition by A Semi-Supervised Progressive
Teacher [54.50747989860957]
We propose a semi-supervised learning algorithm named Progressive Teacher (PT) to utilize reliable FER datasets as well as large-scale unlabeled expression images for effective training.
Experiments on widely-used databases RAF-DB and FERPlus validate the effectiveness of our method, which achieves state-of-the-art performance with accuracy of 89.57% on RAF-DB.
arXiv Detail & Related papers (2022-05-28T07:47:53Z) - Re-TACRED: Addressing Shortcomings of the TACRED Dataset [5.820381428297218]
TACRED is one of the largest and most widely used sentence-level relation extraction datasets.
Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance.
However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora.
arXiv Detail & Related papers (2021-04-16T22:55:11Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - On the Role of Dataset Quality and Heterogeneity in Model Confidence [27.657631193015252]
Safety-critical applications require machine learning models that output accurate and calibrated probabilities.
Uncalibrated deep networks are known to make over-confident predictions.
We study the impact of dataset quality by studying the impact of dataset size and the label noise on the model confidence.
arXiv Detail & Related papers (2020-02-23T05:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.