Learning Defect Prediction from Unrealistic Data
- URL: http://arxiv.org/abs/2311.00931v2
- Date: Sat, 20 Jan 2024 17:05:40 GMT
- Title: Learning Defect Prediction from Unrealistic Data
- Authors: Kamel Alrashedy, Vincent J. Hellendoorn, Alessandro Orso
- Abstract summary: Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
- Score: 57.53586547895278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained models of code, such as CodeBERT and CodeT5, have become popular
choices for code understanding and generation tasks. Such models tend to be
large and require commensurate volumes of training data, which are rarely
available for downstream tasks. Instead, it has become popular to train models
with far larger but less realistic datasets, such as functions with
artificially injected bugs. Models trained on such data, however, tend to only
perform well on similar data, while underperforming on real world programs. In
this paper, we conjecture that this discrepancy stems from the presence of
distracting samples that steer the model away from the real-world task
distribution. To investigate this conjecture, we propose an approach for
identifying the subsets of these large yet unrealistic datasets that are most
similar to examples in real-world datasets based on their learned
representations. Our approach extracts high-dimensional embeddings of both
real-world and artificial programs using a neural model and scores artificial
samples based on their distance to the nearest real-world sample. We show that
training on only the nearest, representationally most similar samples while
discarding samples that are not at all similar in representations yields
consistent improvements across two popular pretrained models of code on two
code understanding tasks. Our results are promising, in that they show that
training models on a representative subset of an unrealistic dataset can help
us harness the power of large-scale synthetic data generation while preserving
downstream task performance. Finally, we highlight the limitations of applying
AI models for predicting vulnerabilities and bugs in real-world applications
Related papers
- Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models.
We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Efficiently Robustify Pre-trained Models [18.392732966487582]
robustness of large scale models towards real-world settings is still a less-explored topic.
We first benchmark the performance of these models under different perturbations and datasets.
We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks.
arXiv Detail & Related papers (2023-09-14T08:07:49Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - The Big Data Myth: Using Diffusion Models for Dataset Generation to
Train Deep Detection Models [0.15469452301122172]
This study presents a framework for the generation of synthetic datasets by fine-tuning stable diffusion models.
The results of this study reveal that the object detection models trained on synthetic data perform similarly to the baseline model.
arXiv Detail & Related papers (2023-06-16T10:48:52Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification)
We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs.
Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.