Related papers: Improving the detection of technical debt in Java source code with an enriched dataset

Improving the detection of technical debt in Java source code with an enriched dataset

URL: http://arxiv.org/abs/2411.05457v1
Date: Fri, 08 Nov 2024 10:12:33 GMT
Title: Improving the detection of technical debt in Java source code with an enriched dataset
Authors: Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman,
Abstract summary: Technical debt (TD) is the additional work and costs that emerge when developers opt for a quick and easy solution to a problem. Recent research has focused on detecting Self-Admitted Technical Debts (SATDs) by analyzing comments embedded in source code. We curated the first ever dataset of TD identified by code comments, coupled with its associated source code.
Score: 12.07607688189035
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.

Related papers

DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z)
Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD) [4.114847619719728]
Self-Admitted Technical Debt (SATD) is a sub-type of technical debt (TD)<n>Previous research on SATD has focused predominantly on the Java programming language.<n>We introduce CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts.
arXiv Detail & Related papers (2025-05-02T09:25:41Z)
An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Knowledge Graph Completion with Relation-Aware Anchor Enhancement [50.50944396454757]
We propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC) We first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching.
arXiv Detail & Related papers (2025-04-08T15:22:08Z)
Leveraging multi-task learning to improve the detection of SATD and vulnerability [2.5385600700122737]
Self-Admitted Technical Debt (SATD) are comments in the code that indicate not-quite-right code introduced for short-term needs. VulSATD is a deep learner that detects vulnerable and SATD code based on CodeBERT.
arXiv Detail & Related papers (2025-01-27T10:31:07Z)
GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code [2.399010142304227]
MADE-WIC is a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses. It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects.
arXiv Detail & Related papers (2024-08-09T16:32:38Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
Systematic literature review on forecasting and prediction of technical debt evolution [0.0]
Technical debt (TD) refers to the additional costs incurred due to compromises in software quality. This study aims to explore existing knowledge in software engineering to gain insights into approaches proposed in research and industry.
arXiv Detail & Related papers (2024-06-17T18:50:37Z)
A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research. Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z)
SATDAUG -- A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt [6.699060157800401]
Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts. We share the textitSATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages.
arXiv Detail & Related papers (2024-03-12T14:33:53Z)
What Can Self-Admitted Technical Debt Tell Us About Security? A Mixed-Methods Study [6.286506087629511]
Self-Admitted Technical Debt (SATD) can be deemed as dreadful sources of information on potentially exploitable vulnerabilities and security flaws. This work investigates the security implications of SATD from a technical and developer-centred perspective.
arXiv Detail & Related papers (2024-01-23T13:48:49Z)
Utilization of machine learning for the detection of self-admitted vulnerabilities [0.0]
Technical debt is a metaphor that describes not-quite-right code introduced for short-term needs. Developers are aware of it and admit it in source code comments, which is called Self- Admitted Technical Debt (SATD)
arXiv Detail & Related papers (2023-09-27T12:38:12Z)
SF-FSDA: Source-Free Few-Shot Domain Adaptive Object Detection with Efficient Labeled Data Factory [94.11898696478683]
Domain adaptive object detection aims to leverage the knowledge learned from a labeled source domain to improve the performance on an unlabeled target domain. We propose and investigate a more practical and challenging domain adaptive object detection problem under both source-free and few-shot conditions, named as SF-FSDA.
arXiv Detail & Related papers (2023-06-07T12:34:55Z)
A Continual Deepfake Detection Benchmark: Dataset, Methods, and Essentials [97.69553832500547]
This paper suggests a continual deepfake detection benchmark (CDDB) over a new collection of deepfakes from both known and unknown generative models. We exploit multiple approaches to adapt multiclass incremental learning methods, commonly used in the continual visual recognition, to the continual deepfake detection problem.
arXiv Detail & Related papers (2022-05-11T13:07:19Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.