Related papers: Investigating The Smells of LLM Generated Code

Investigating The Smells of LLM Generated Code

URL: http://arxiv.org/abs/2510.03029v1
Date: Fri, 03 Oct 2025 14:09:55 GMT
Title: Investigating The Smells of LLM Generated Code
Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley,
Abstract summary: Large Language Models (LLMs) are increasingly being used to generate program code.<n>This study proposes a scenario-based method of evaluating the quality of LLM-generated code.
Score: 2.9232837969697965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code.

Related papers

Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection [0.5249836059995157]
Code smells are symptoms of potential code quality problems that may affect software maintainability.<n>This paper evaluates the effectiveness of four large language models (LLMs) for detecting nine code smells across 30 Java projects.
arXiv Detail & Related papers (2026-01-14T21:08:35Z)
CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution [68.95247403447051]
We introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment.<n>We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments.<n>For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences.
arXiv Detail & Related papers (2025-10-09T18:01:47Z)
Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset [13.23492570818459]
This study takes the first systematic research to assess and improve dataset quality in terms of code smells.<n>We propose an LLM-based code smell cleaning tool, named SmellCC, which automatically removes code smells.
arXiv Detail & Related papers (2025-08-16T07:40:58Z)
Is LLM-Generated Code More Maintainable \& Reliable than Human-Written Code? [4.893345190925178]
This study compares the internal quality attributes of LLM-generated and human-written code.<n>Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall.
arXiv Detail & Related papers (2025-08-01T15:17:34Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent [2.8391355909797644]
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation.<n>There is still a gap between LLMs being capable coders and being top-tier software engineers.
arXiv Detail & Related papers (2024-05-31T22:06:18Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.<n>We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations.<n>We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.