Towards Automatically Addressing Self-Admitted Technical Debt: How Far
Are We?
- URL: http://arxiv.org/abs/2308.08943v1
- Date: Thu, 17 Aug 2023 12:27:32 GMT
- Title: Towards Automatically Addressing Self-Admitted Technical Debt: How Far
Are We?
- Authors: Antonio Mastropaolo, Massimiliano Di Penta, Gabriele Bavota
- Abstract summary: This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models.
We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects.
We use this dataset to experiment with seven different generative deep learning (DL) model configurations.
- Score: 17.128428286986573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Upon evolving their software, organizations and individual developers have to
spend a substantial effort to pay back technical debt, i.e., the fact that
software is released in a shape not as good as it should be, e.g., in terms of
functionality, reliability, or maintainability. This paper empirically
investigates the extent to which technical debt can be automatically paid back
by neural-based generative models, and in particular models exploiting
different strategies for pre-training and fine-tuning. We start by extracting a
dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595
open-source projects. SATD refers to technical debt instances documented (e.g.,
via code comments) by developers. We use this dataset to experiment with seven
different generative deep learning (DL) model configurations. Specifically, we
compare transformers pre-trained and fine-tuned with different combinations of
training objectives, including the fixing of generic code changes, SATD
removals, and SATD-comment prompt tuning. Also, we investigate the
applicability in this context of a recently-available Large Language Model
(LLM)-based chat bot. Results of our study indicate that the automated
repayment of SATD is a challenging task, with the best model we experimented
with able to automatically fix ~2% to 8% of test instances, depending on the
number of attempts it is allowed to make. Given the limited size of the
fine-tuning dataset (~5k instances), the model's pre-training plays a
fundamental role in boosting performance. Also, the ability to remove SATD
steadily drops if the comment documenting the SATD is not provided as input to
the model. Finally, we found general-purpose LLMs to not be a competitive
approach for addressing SATD.
Related papers
- A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems [13.90991624629898]
This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in Deep Learning systems.
We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves.
arXiv Detail & Related papers (2024-09-18T09:21:10Z) - SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted
Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts.
We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Automatically Estimating the Effort Required to Repay Self-Admitted
Technical Debt [1.8208834479445897]
Self-Admitted Technical Debt (SATD) is a specific form of technical debt documented by developers within software artifacts.
We propose a novel approach for automatically estimating SATD repayment effort, utilizing a comprehensive dataset.
Our findings show that different types of SATD require varying levels of repayment effort, with code/design, requirement, and test debt demanding greater effort compared to non-SATD items.
arXiv Detail & Related papers (2023-09-12T07:40:18Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Identifying Self-Admitted Technical Debt in Issue Tracking Systems using
Machine Learning [3.446864074238136]
Technical debt is a metaphor for sub-optimal solutions implemented for short-term benefits.
Most work on identifying Self-Admitted Technical Debt focuses on source code comments.
We propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning.
arXiv Detail & Related papers (2022-02-04T15:15:13Z) - DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt
Identification using Semi-Supervised Learning [31.13621632964345]
DebtFree is a two-mode framework based on unsupervised learning for identifying SATDs.
Our experiments on 10 software projects show that both models yield a statistically significant improvement over the state-of-the-art automated and semi-automated models.
arXiv Detail & Related papers (2022-01-25T19:21:24Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.