Related papers: Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

URL: http://arxiv.org/abs/2509.25243v1
Date: Fri, 26 Sep 2025 08:40:17 GMT
Title: Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation
Authors: Xunzhu Tang, Iyiola Emmanuel Olatunji, Tiezhu Sun, Jacques Klein, Tegawende F. Bissyande,
Abstract summary: LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks.<n>We propose multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions.
Score: 7.69951622965475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose \multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show \multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50\% and improving LLM response quality, making \multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization [28.52469449694436]
Large language models (LLMs) have shown strong performance in math and logic reasoning.<n>But their ability to handle systematic optimization (CO) remains underexplored.<n>We introduce NLCO, a benchmark that evaluates LLMs on end-to-end CO reasoning.
arXiv Detail & Related papers (2026-02-02T14:55:48Z)
Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z)
Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization [20.44067161623662]
Large language models (LLMs) have demonstrated remarkable capabilities in code generation and structured reasoning.<n>We propose a novel neural-symbolic framework that formulates prompt selection as a sequential decision process guided by Monte Carlo Tree Search.<n>Our method explores and refines multi-step prompt sequences for the goal of improving code generation quality.
arXiv Detail & Related papers (2025-08-08T04:01:24Z)
Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning [30.265984245328124]
Chain-of-Thought prompting indiscriminately generates lengthy reasoning steps for all queries.<n>AdaCoT (Adaptive Chain-of-Thought) is a novel framework enabling LLMs to adaptively decide when to invoke CoT.<n>A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse.
arXiv Detail & Related papers (2025-05-17T08:27:00Z)
Uncertainty-Guided Chain-of-Thought for Code Generation with LLMs [45.33160999781074]
Chain-of-Thought (CoT) reasoning has been demonstrated as an effective technique for improving the problem-solving capabilities of large language models (LLMs)<n>We introduce UnCert-CoT, an approach designed to enhance code generation by incorporating an uncertainty-aware CoT reasoning mechanism.
arXiv Detail & Related papers (2025-03-19T15:40:45Z)
Robust Multi-Objective Controlled Decoding of Large Language Models [14.58153072993207]
We introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimize for improving worst-case rewards.<n>RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy.<n>We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically.
arXiv Detail & Related papers (2025-03-11T18:15:26Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.