Related papers: Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks

Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks

URL: http://arxiv.org/abs/2506.10987v1
Date: Wed, 12 Mar 2025 07:44:18 GMT
Title: Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks
Authors: Shaoyi Yang,
Abstract summary: This research extends the Chain of Draft (CoD) method to software engineering.<n>All CoD variants used significantly fewer tokens than Chain of Thought (CoT)<n>CoD variants maintain over 90% of CoT's code quality across key metrics including correctness, compatibility, and maintainability.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have become vital tools for software development, but they often require verbose intermediate reasoning for complex code tasks, leading to high latency and costs. This research extends the Chain of Draft (CoD) method to software engineering, designing and evaluating multiple CoD variants tailored for code tasks. Through comprehensive experiments on all 300 samples from the SWE-bench benchmark, we found that all CoD variants used significantly fewer tokens than Chain of Thought (CoT), with Baseline CoD being most efficient at 55.4% of CoT's tokens. While this represents substantial efficiency gains - translating to approximately 45% reduction in processing time and API costs - it differs from the extreme 7.6% reported in the original CoD paper for mathematical reasoning. This difference stems from the inherent complexity and context-dependency of software tasks, which require more detailed reasoning to maintain solution quality. Our multi-dimensional quality assessment revealed that CoD variants maintain over 90% of CoT's code quality across key metrics including correctness, compatibility, and maintainability, making them practical alternatives for real-world development scenarios where efficiency matters. This research demonstrates how domain-specific characteristics influence prompting strategy effectiveness and provides a framework for balancing efficiency with solution quality in software engineering applications. Our findings offer practical guidance for optimizing LLM-based development workflows through appropriate prompting strategy selection based on project requirements.

Related papers

Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering.<n>We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks.
arXiv Detail & Related papers (2026-02-06T03:30:19Z)
Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework [0.26508608365976566]
In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion.<n>Self-critique succeeds on code-reviewable logic errors but fails completely on external service integration.<n>RAG achieves highest completion across all failure types with superior efficiency.
arXiv Detail & Related papers (2026-02-02T23:08:03Z)
Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering [4.812321790984494]
We conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC)<n>We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model.<n>Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens.
arXiv Detail & Related papers (2026-01-20T20:52:14Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning [65.20602712957725]
Caco is a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data.<n>Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
arXiv Detail & Related papers (2025-10-05T07:59:24Z)
Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation [7.69951622965475]
LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks.<n>We propose multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions.
arXiv Detail & Related papers (2025-09-26T08:40:17Z)
Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems [0.0]
Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks.<n>This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties.<n>We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences.
arXiv Detail & Related papers (2025-08-05T21:50:52Z)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation [19.071855537400463]
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance.<n>CoCo-Bench is designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review.
arXiv Detail & Related papers (2025-04-29T11:57:23Z)
Optimizing Token Consumption in LLMs: A Nano Surge Approach for Code Reasoning Efficiency [5.044393644778693]
Chain of Thought (CoT) reasoning has become an essential approach for automated code repair.<n>CoT leads to substantial increases in token consumption, reducing inference efficiency and raising computational costs.<n>This paper proposes three targeted optimization strategies: Context Awareness, Responsibility Tuning, and Cost Sensitive.
arXiv Detail & Related papers (2025-04-22T15:51:00Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
LATTE: Learning to Think with Vision Specialists [103.5952731807559]
We propose LATTE, a family of vision-language models that offload perception to state-of-the-art vision models.<n>By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information.
arXiv Detail & Related papers (2024-12-07T00:42:04Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.<n>CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.<n>Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.<n>We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving [3.3856216159724983]
We introduce a new approach to code generation tasks leveraging multi-agent prompting. Our framework, MapCoder, consists of four LLM agents specifically designed to emulate the stages of program synthesis. Our method consistently delivers superior performance across various programming languages.
arXiv Detail & Related papers (2024-05-18T22:10:15Z)
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts [31.783388267874738]
We propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant.
arXiv Detail & Related papers (2024-05-07T17:52:51Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology [4.2990995991059275]
Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) have transformed the field of Software Engineering. We introduce CodePori, a novel system designed to automate code generation for large and complex software projects. Results: CodePori is able to generate running code for large-scale projects, aligned with the typical software development process.
arXiv Detail & Related papers (2024-02-02T13:42:50Z)
Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z)
Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs. For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.