Improving the detection of technical debt in Java source code with an enriched dataset
- URL: http://arxiv.org/abs/2411.05457v1
- Date: Fri, 08 Nov 2024 10:12:33 GMT
- Title: Improving the detection of technical debt in Java source code with an enriched dataset
- Authors: Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman,
- Abstract summary: Technical debt (TD) is the additional work and costs that emerge when developers opt for a quick and easy solution to a problem.
Recent research has focused on detecting Self-Admitted Technical Debts (SATDs) by analyzing comments embedded in source code.
We curated the first ever dataset of TD identified by code comments, coupled with its associated source code.
- Score: 12.07607688189035
- License:
- Abstract: Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code [2.399010142304227]
MADE-WIC is a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses.
It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects.
arXiv Detail & Related papers (2024-08-09T16:32:38Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - Systematic literature review on forecasting and prediction of technical debt evolution [0.0]
Technical debt (TD) refers to the additional costs incurred due to compromises in software quality.
This study aims to explore existing knowledge in software engineering to gain insights into approaches proposed in research and industry.
arXiv Detail & Related papers (2024-06-17T18:50:37Z) - A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research.
Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z) - SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted
Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts.
We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z) - What Can Self-Admitted Technical Debt Tell Us About Security? A
Mixed-Methods Study [6.286506087629511]
Self-Admitted Technical Debt (SATD)
can be deemed as dreadful sources of information on potentially exploitable vulnerabilities and security flaws.
This work investigates the security implications of SATD from a technical and developer-centred perspective.
arXiv Detail & Related papers (2024-01-23T13:48:49Z) - Utilization of machine learning for the detection of self-admitted
vulnerabilities [0.0]
Technical debt is a metaphor that describes not-quite-right code introduced for short-term needs.
Developers are aware of it and admit it in source code comments, which is called Self- Admitted Technical Debt (SATD)
arXiv Detail & Related papers (2023-09-27T12:38:12Z) - SF-FSDA: Source-Free Few-Shot Domain Adaptive Object Detection with
Efficient Labeled Data Factory [94.11898696478683]
Domain adaptive object detection aims to leverage the knowledge learned from a labeled source domain to improve the performance on an unlabeled target domain.
We propose and investigate a more practical and challenging domain adaptive object detection problem under both source-free and few-shot conditions, named as SF-FSDA.
arXiv Detail & Related papers (2023-06-07T12:34:55Z) - A Continual Deepfake Detection Benchmark: Dataset, Methods, and
Essentials [97.69553832500547]
This paper suggests a continual deepfake detection benchmark (CDDB) over a new collection of deepfakes from both known and unknown generative models.
We exploit multiple approaches to adapt multiclass incremental learning methods, commonly used in the continual visual recognition, to the continual deepfake detection problem.
arXiv Detail & Related papers (2022-05-11T13:07:19Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.