Related papers: Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation

Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation

URL: http://arxiv.org/abs/2507.06980v1
Date: Wed, 09 Jul 2025 16:07:20 GMT
Title: Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation
Authors: Binquan Zhang, Li Zhang, Zhiwen Luo, Yuxin Du, Fang Liu, Song Wang, Lin Shi,
Abstract summary: Large language models (LLMs) have demonstrated impressive performance in code generation.<n>However, little is known about the quality of chain-of-thought (CoT) generated by LLMs.<n>This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs.
Score: 11.090557370168439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated impressive performance in code generation, particularly when augmented with chain-of-thought (CoT) prompting techniques. They break down requirements into intermediate reasoning steps, which act as design rationales to guide LLMs in writing code like human programmers. Thus, the quality of these steps is crucial for ensuring the correctness and reliability of the generated code. However, little is known about the quality of CoT generated by LLMs. To what extent can we trust the thoughts generated by LLMs? How good are they? This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs by analyzing 1,023 failed code samples on two widely used code generation benchmarks. We also evaluate their impact on code generation performance by analyzing 210 CoT-code pairs and refining the unsatisfied CoTs by prompting LLMs. Our study reveals three key findings: (1) External factors (53.60%), such as unclear requirements and lack of context, mainly affect CoT quality, while internal factors (40.10%) stem from LLMs' misunderstanding prompts. (2) Even when CoTs are correct, 18.5% of the generated code contains errors due to instruction-following issues; conversely, 11.90% of correct code is paired with flawed CoTs. (3) Refining low-quality CoTs is feasible, i.e., LLMs improve when given detailed problem descriptions. These findings highlight key challenges in CoT-based code generation and suggest directions for improving LLM reasoning and reliability.

Related papers

Is LLM-Generated Code More Maintainable \& Reliable than Human-Written Code? [4.893345190925178]
This study compares the internal quality attributes of LLM-generated and human-written code.<n>Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall.
arXiv Detail & Related papers (2025-08-01T15:17:34Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
Uncertainty-Guided Chain-of-Thought for Code Generation with LLMs [45.33160999781074]
Chain-of-Thought (CoT) reasoning has been demonstrated as an effective technique for improving the problem-solving capabilities of large language models (LLMs)<n>We introduce UnCert-CoT, an approach designed to enhance code generation by incorporating an uncertainty-aware CoT reasoning mechanism.
arXiv Detail & Related papers (2025-03-19T15:40:45Z)
Instruct or Interact? Exploring and Eliciting LLMs' Capability in Code Snippet Adaptation Through Prompt Engineering [19.019004855931676]
Large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. Their performance on adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. We propose an interactive prompting approach to elicit LLMs' adaptation ability.
arXiv Detail & Related papers (2024-11-23T09:40:36Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent [2.8391355909797644]
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation.<n>There is still a gap between LLMs being capable coders and being top-tier software engineers.
arXiv Detail & Related papers (2024-05-31T22:06:18Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
Chain-of-Thought in Neural Code Generation: From and For Lightweight Language Models [22.392809555644646]
Large Language Models (LLMs) have demonstrated remarkable potential in code generation. In this study, we investigate lightweight Language Models (lLMs) which are defined to have fewer than 10 billion parameters. Based on these findings, we design a novel approach COTTON which can leverage lLMs to automatically generate Chain of Thought (CoTs) The results show that the CoTs generated by COTTON outperform the baselines in terms of automated and human evaluation metrics.
arXiv Detail & Related papers (2023-12-09T12:20:50Z)
Fixing Large Language Models' Specification Misunderstanding for Better Code Generation [13.494822086550604]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)<n>It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.<n>muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z)
Structured Chain-of-Thought Prompting for Code Generation [48.43888515848583]
Chain-of-Thought (CoT) prompting is the state-of-the-art prompting technique. We propose Structured CoTs (SCoTs) and present a novel prompting technique for code generation, named SCoT prompting.
arXiv Detail & Related papers (2023-05-11T06:43:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.