Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset
- URL: http://arxiv.org/abs/2508.11958v1
- Date: Sat, 16 Aug 2025 07:40:58 GMT
- Title: Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset
- Authors: Zhipeng Xue, Xiaoting Zhang, Zhipeng Gao, Xing Hu, Shan Gao, Xin Xia, Shanping Li,
- Abstract summary: This study takes the first systematic research to assess and improve dataset quality in terms of code smells.<n>We propose an LLM-based code smell cleaning tool, named SmellCC, which automatically removes code smells.
- Score: 13.23492570818459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. In this work, we first conduct a preliminary study to explore the presence of code smells in a popular benchmark dataset (i.e., CodeSearchNet-Python}) and evaluate the output of several popular LLMs (i.e., DeepSeek-Coder, CodeLlama, and MagiCoder), revealing that code smell issues extensively exist in LLM's input (e.g., benchmark dataset) and output (e.g., generated code). We then conduct our systematic research by taking three main steps: Firstly, we propose an LLM-based code smell cleaning tool, named SmellCC, which automatically refactors and removes code smells. To evaluate the correctness of the code refactoring, we construct a test set of 50 repositories sourced from the CodeSearchNet-Python benchmark for functional testing. Then we apply our curated smell-cleaned dataset to fine-tune two LLMs (i.e., DeepSeek-V2 and Qwen-Coder) to explore their potential for generating high-quality code. Thirdly, we investigate the impact of code smells on two downstream tasks: code completion and code search. Lastly, we derive several actionable implications for software engineering researchers and industry practitioners from our findings.
Related papers
- Evaluating and Achieving Controllable Code Completion in Code LLM [89.64782747840225]
We present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench)<n>We reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks.<n>The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench.
arXiv Detail & Related papers (2026-01-22T11:40:04Z) - Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection [0.5249836059995157]
Code smells are symptoms of potential code quality problems that may affect software maintainability.<n>This paper evaluates the effectiveness of four large language models (LLMs) for detecting nine code smells across 30 Java projects.
arXiv Detail & Related papers (2026-01-14T21:08:35Z) - Investigating The Smells of LLM Generated Code [2.9232837969697965]
Large Language Models (LLMs) are increasingly being used to generate program code.<n>This study proposes a scenario-based method of evaluating the quality of LLM-generated code.
arXiv Detail & Related papers (2025-10-03T14:09:55Z) - How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [45.126233498200534]
We introduce CodeSmellEval, a benchmark designed to evaluate the propensity of Large Language Models for generating code smells.<n>Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData.<n>To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral.
arXiv Detail & Related papers (2024-12-25T21:56:35Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.