Related papers: Self-Admitted Technical Debt in LLM Software: An Empirical Comparison with ML and Non-ML Software

Self-Admitted Technical Debt in LLM Software: An Empirical Comparison with ML and Non-ML Software

URL: http://arxiv.org/abs/2601.06266v2
Date: Tue, 13 Jan 2026 02:51:00 GMT
Title: Self-Admitted Technical Debt in LLM Software: An Empirical Comparison with ML and Non-ML Software
Authors: Niruthiha Selvanayagam, Taher A. Ghaleb, Manel Abdellatif,
Abstract summary: Self-admitted technical debt (SATD) refers to comments flagged by developers that explicitly acknowledge suboptimal code or incomplete functionality.<n>We conduct the first empirical study of SATD in the Large Language Model era.
Score: 0.8156494881838944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-admitted technical debt (SATD), referring to comments flagged by developers that explicitly acknowledge suboptimal code or incomplete functionality, has received extensive attention in machine learning (ML) and traditional (Non-ML) software. However, little is known about how SATD manifests and evolves in contemporary Large Language Model (LLM)-based systems, whose architectures, workflows, and dependencies differ fundamentally from both traditional and pre-LLM ML software. In this paper, we conduct the first empirical study of SATD in the LLM era, replicating and extending prior work on ML technical debt to modern LLM-based systems. We compare SATD prevalence across LLM, ML, and non-ML repositories across a total of 477 repositories (159 per category). We perform survival analysis of SATD introduction and removal to understand the dynamics of technical debt across different development paradigms. Surprisingly, despite their architectural complexity, our results reveal that LLM repositories accumulate SATD at similar rates to ML systems (3.95% vs. 4.10%). However, we observe that LLM repositories remain debt-free 2.4x longer than ML repositories (a median of 492 days vs. 204 days), and then start to accumulate technical debt rapidly. Moreover, our qualitative analysis of 377 SATD instances reveals three new forms of technical debt unique to LLM-based development that have not been reported in prior research: Model-Stack Workaround Debt, Model Dependency Debt, and Performance Optimization Debt. Finally, by mapping SATD to stages of the LLM development pipeline, we observe that debt concentrates

Related papers

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z)
PromptDebt: A Comprehensive Study of Technical Debt Across LLM Projects [0.0]
Large Language Models (LLMs) are increasingly embedded in software via OpenAI, offering powerful AI features without heavy infrastructure.<n>Yet these integrations bring their own form of self-admitted technical debt (SATD)<n>In this paper, we present the first large-scale empirical study of SATD: its origins, prevalence, and mitigation strategies.
arXiv Detail & Related papers (2025-09-24T19:20:09Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection [15.026084450436976]
We present a study evaluating the performance of large language models (LLMs) on the software vulnerability detection task.<n>We have compiled a dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript.<n>These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools.
arXiv Detail & Related papers (2025-03-03T11:56:00Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs [42.72336063802124]
Multi-modal Large Language Models (MLLMs) have exhibited impressive capability.<n>Many deficiencies of MLLMs have been found compared to human intelligence, $textite.g.$, hallucination.<n>We propose benchmarking an essential but usually overlooked intelligence: $textbfassociation$, a human's basic capability to link observation and prior practice memory.
arXiv Detail & Related papers (2024-10-02T10:58:54Z)
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future [15.568939568441317]
We investigate the current practice and solutions for large language models (LLMs) and LLM-based agents for software engineering.<n>In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance.<n>We discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering.
arXiv Detail & Related papers (2024-08-05T14:01:15Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software [17.999512016809945]
Self-admitted technical debt (SATD) can have a significant impact on the quality of machine learning-based software. This paper aims to investigate SATD in ML code by analyzing 318 open-source ML projects across five domains, along with 318 non-ML projects.
arXiv Detail & Related papers (2023-11-20T18:56:36Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.<n>This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.