DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt
Identification using Semi-Supervised Learning
- URL: http://arxiv.org/abs/2201.10592v1
- Date: Tue, 25 Jan 2022 19:21:24 GMT
- Title: DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt
Identification using Semi-Supervised Learning
- Authors: Huy Tu and Tim Menzies
- Abstract summary: DebtFree is a two-mode framework based on unsupervised learning for identifying SATDs.
Our experiments on 10 software projects show that both models yield a statistically significant improvement over the state-of-the-art automated and semi-automated models.
- Score: 31.13621632964345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keeping track of and managing Self-Admitted Technical Debts (SATDs) is
important for maintaining a healthy software project. Current active-learning
SATD recognition tool involves manual inspection of 24% of the test comments on
average to reach 90% of the recall. Among all the test comments, about 5% are
SATDs. The human experts are then required to read almost a quintuple of the
SATD comments which indicates the inefficiency of the tool. Plus, human experts
are still prone to error: 95% of the false-positive labels from previous work
were actually true positives.
To solve the above problems, we propose DebtFree, a two-mode framework based
on unsupervised learning for identifying SATDs. In mode1, when the existing
training data is unlabeled, DebtFree starts with an unsupervised learner to
automatically pseudo-label the programming comments in the training data. In
contrast, in mode2 where labels are available with the corresponding training
data, DebtFree starts with a pre-processor that identifies the highly prone
SATDs from the test dataset. Then, our machine learning model is employed to
assist human experts in manually identifying the remaining SATDs. Our
experiments on 10 software projects show that both models yield a statistically
significant improvement in effectiveness over the state-of-the-art automated
and semi-automated models. Specifically, DebtFree can reduce the labeling
effort by 99% in mode1 (unlabeled training data), and up to 63% in mode2
(labeled training data) while improving the current active learner's F1
relatively to almost 100%.
Related papers
- Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - One-bit Supervision for Image Classification: Problem, Solution, and
Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification.
We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm.
In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z) - Towards Automatically Addressing Self-Admitted Technical Debt: How Far
Are We? [17.128428286986573]
This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models.
We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects.
We use this dataset to experiment with seven different generative deep learning (DL) model configurations.
arXiv Detail & Related papers (2023-08-17T12:27:32Z) - When Less is More: On the Value of "Co-training" for Semi-Supervised
Software Defect Predictors [15.862838836160634]
This paper applies a wide range of 55 semi-supervised learners to over 714 projects.
We find that semi-supervised "co-training methods" work significantly better than other approaches.
arXiv Detail & Related papers (2022-11-10T23:39:12Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Identifying Self-Admitted Technical Debt in Issue Tracking Systems using
Machine Learning [3.446864074238136]
Technical debt is a metaphor for sub-optimal solutions implemented for short-term benefits.
Most work on identifying Self-Admitted Technical Debt focuses on source code comments.
We propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning.
arXiv Detail & Related papers (2022-02-04T15:15:13Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Don't Wait, Just Weight: Improving Unsupervised Representations by
Learning Goal-Driven Instance Weights [92.16372657233394]
Self-supervised learning techniques can boost performance by learning useful representations from unlabelled data.
We show that by learning Bayesian instance weights for the unlabelled data, we can improve the downstream classification accuracy.
Our method, BetaDataWeighter is evaluated using the popular self-supervised rotation prediction task on STL-10 and Visual Decathlon.
arXiv Detail & Related papers (2020-06-22T15:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.