SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
- URL: http://arxiv.org/abs/2511.02854v1
- Date: Fri, 31 Oct 2025 06:30:47 GMT
- Title: SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
- Authors: Yixiang Chen, Tianshi Zheng, Shijue Huang, Zhitao He, Yi R. Fung,
- Abstract summary: Test-time scaling without interpreter feedback is essential for real-world code generation scenarios.<n>We introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed.
- Score: 12.241337921137259
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.
Related papers
- Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z) - Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z) - A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms [20.241519889633285]
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms play a critical role.<n>We conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS.<n>We introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities.
arXiv Detail & Related papers (2026-01-19T17:23:45Z) - Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z) - Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.2144357080404]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z) - Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking [78.69179041551014]
We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
arXiv Detail & Related papers (2025-10-15T15:51:59Z) - Towards Inference-time Scaling for Continuous Space Reasoning [55.40260529506702]
Inference-time scaling has proven effective for text-based reasoning in large language models.<n>This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space.<n>We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling.
arXiv Detail & Related papers (2025-10-14T05:53:41Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [18.960920426485163]
Self-improvement has emerged as a primary method for enhancing performance.<n>We identify and propose methods to monitor two pivotal factors in this iterative process.<n>We introduce B-STaR, a Self-Taught Reasoning framework that adjusts configurations across iterations to balance exploration and exploitation.
arXiv Detail & Related papers (2024-12-23T03:58:34Z) - LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation [28.61326111959728]
Large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content.
Our study addresses this knowledge gap by simulating two critical phases of the retrieval-augmented generation (RAG) framework.
Contrary to previous findings, our results reveal no significant self-preference effect in RAG frameworks.
arXiv Detail & Related papers (2024-10-28T08:32:09Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.