The Early Bird Catches the Worm: Better Early Life Cycle Defect
Predictors
- URL: http://arxiv.org/abs/2105.11082v1
- Date: Mon, 24 May 2021 03:49:09 GMT
- Title: The Early Bird Catches the Worm: Better Early Life Cycle Defect
Predictors
- Authors: N.C. Shrikanth and Tim Menzies
- Abstract summary: In 240 GitHub projects, we find that the information in that data clumps'' towards the earliest parts of the project.
A defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives.
- Score: 23.22715542777918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Before researchers rush to reason across all available data, they should
first check if the information is densest within some small region. We say this
since, in 240 GitHub projects, we find that the information in that data
``clumps'' towards the earliest parts of the project. In fact, a defect
prediction model learned from just the first 150 commits works as well, or
better than state-of-the-art alternatives. Using just this early life cycle
data, we can build models very quickly (using weeks, not months, of CPU time).
Also, we can find simple models (with just two features) that generalize to
hundreds of software projects. Based on this experience, we warn that prior
work on generalizing software engineering defect prediction models may have
needlessly complicated an inherently simple process. Further, prior work that
focused on later-life cycle data now needs to be revisited since their
conclusions were drawn from relatively uninformative regions. Replication note:
all our data and scripts are online at
https://github.com/snaraya7/early-defect-prediction-tse.
Related papers
- How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning [2.3759432635713895]
We attack both pre-trained and fine-tuned code language models to investigate the extent of data extractability.
Fine-tuning requires fewer resources and is increasingly used by both small and large entities for its effectiveness on specialized data.
Data carriers and licensing information are the most likely data to be memorized from pre-trained and fine-tuned models, while the latter is the most likely to be forgotten after fine-tuning.
arXiv Detail & Related papers (2025-01-29T09:17:30Z) - More precise edge detections [0.0]
Edge detection (ED) is a base task in computer vision.
Current models still suffer from unsatisfactory precision rates.
Model architecture for more precise predictions still needs an investigation.
arXiv Detail & Related papers (2024-07-29T13:24:55Z) - DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss.
Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z) - Learning from Very Little Data: On the Value of Landscape Analysis for
Predicting Software Project Health [13.19204187502255]
This paper only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems.
arXiv Detail & Related papers (2023-01-16T19:27:16Z) - IRJIT: A Simple, Online, Information Retrieval Approach for Just-In-Time Software Defect Prediction [10.084626547964389]
Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time.
Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models.
We propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits.
arXiv Detail & Related papers (2022-10-05T17:54:53Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - On Anytime Learning at Macroscale [33.674452784463774]
In many practical applications, data does not arrive all at once, but in batches over time.
A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data.
A tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance.
arXiv Detail & Related papers (2021-06-17T14:45:22Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven.
In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives.
Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z) - Early Life Cycle Software Defect Prediction. Why? How? [37.48549087467758]
We analyzed hundreds of popular GitHub projects for 84 months.
Across these projects, most of the defects occur very early in their life cycle.
We hope these results inspire other researchers to adopt a "simplicity-first" approach to their work.
arXiv Detail & Related papers (2020-11-26T00:13:52Z) - Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased.
We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief.
In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.