Related papers: A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

URL: http://arxiv.org/abs/2508.12903v2
Date: Sun, 05 Oct 2025 15:00:27 GMT
Title: A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
Authors: Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao,
Abstract summary: ProActive Self-Refinement (PASR) is a novel method for improving large language models (LLMs)<n>Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context.<n>We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR.
Score: 53.31664844941449
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and baselines used in the paper are available on GitHub.

Related papers

Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
CorrectBench: Automatic Testbench Generation with Functional Self-Correction using LLMs for HDL Design [6.414167153186868]
We propose CorrectBench, an automatic testbench generation framework with functional self-validation and self-correction. The proposed approach can validate the correctness of the generated testbenches with a success rate of 88.85%. Our work's performance is 62.18% higher than previous work in sequential tasks and almost 5 times the pass ratio of the direct method.
arXiv Detail & Related papers (2024-11-13T10:45:19Z)
An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token. We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z)
CERET: Cost-Effective Extrinsic Refinement for Text Generation [14.43795791836198]
We propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups.
arXiv Detail & Related papers (2024-06-08T22:17:52Z)
Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.<n>We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
Lightweight reranking for language model generations [26.942659041383596]
We present a novel approach for reranking Large Language Models (LLMs) generations. Unlike other techniques that might involve additional inferences or training a specialized reranker, our approach relies on easy to compute pairwise statistics. We show strong improvements for selecting the best k generations for code generation tasks as well as robust improvements for the best generation for the tasks of autoformalization, summarization, and translation.
arXiv Detail & Related papers (2023-07-11T17:51:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.