Related papers: Reflect before Act: Proactive Error Correction in Language Models

Reflect before Act: Proactive Error Correction in Language Models

URL: http://arxiv.org/abs/2509.18607v1
Date: Tue, 23 Sep 2025 03:53:45 GMT
Title: Reflect before Act: Proactive Error Correction in Language Models
Authors: Qiuhai Zeng, Sarvesh Rajkumar, Di Wang, Narendra Gyanchandani, Wenbo Yan,
Abstract summary: "Reflect before Act" (REBACT) is a novel approach that enhances LLM-based decision-making by introducing a critical reflect step prior to taking the next action.<n>We evaluate REBACT on three diverse interactive environments: ALFWorld, WebShop, and TextCraft.
Score: 2.8286165811226094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in interactive decision-making tasks, but existing methods often struggle with error accumulation and lack robust self-correction mechanisms. We introduce "Reflect before Act" (REBACT), a novel approach that enhances LLM-based decision-making by introducing a critical reflect step prior to taking the next action. This approach allows for immediate error correction, ensuring smooth action path and adaptibity to environment feedback. We evaluate REBACT on three diverse interactive environments: ALFWorld, WebShop, and TextCraft. Our results demonstrate that REBACT significantly outperforms strong baselines, improving success rates by up to 24% on WebShop (achieving 61%), 6.72% on ALFWorld (achieving 98.51%), and 0.5% on TextCraft (achieving 99.5%) using Claude3.5-sonnet as the underlying LLM. Further analysis reveals that REBACT's performance improvements are achieved with only a few modification steps, demonstrating its computational efficiency.

Related papers

DEPO: Dual-Efficiency Preference Optimization for LLM Agents [75.6723341304463]
We propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps.<n>Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance.
arXiv Detail & Related papers (2025-11-19T12:38:43Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Understanding Robustness of Model Editing in Code LLMs: An Empirical Study [1.5624785508022727]
We present a systematic study of five state-of-the-art model editing methods.<n>We apply these methods to three leading open-source code LLMs, CodeLlama, CodeQwen1.5, and DeepSeek-Coder.<n>Instant edits consistently degrade model performance, with syntactic validity dropping by up to 86 percentage points and functional correctness declining by 45 points even in the best-performing setting.
arXiv Detail & Related papers (2025-11-05T04:58:13Z)
BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions [19.682278660857584]
We introduce BloomAPR, a novel dynamic evaluation framework grounded in Bloom's taxonomy.<n>Our framework offers a structured approach to assess the cognitive capabilities of LLM-powered APR solutions across progressively complex reasoning levels.<n>Our findings show that while these solutions exhibit basic reasoning skills, their performance increases with synthetically generated bugs.
arXiv Detail & Related papers (2025-09-29T20:16:11Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection [60.75785864719726]
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning.<n>We construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks.
arXiv Detail & Related papers (2025-05-22T10:03:05Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs [29.735465300269993]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often struggle with spatial reasoning.<n>This paper presents a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities through iterative feedback between LLMs and Answer Set Programming (ASP)<n>We evaluate our approach on two benchmark datasets: StepGame and SparQA.
arXiv Detail & Related papers (2024-11-27T18:04:05Z)
Instruct or Interact? Exploring and Eliciting LLMs' Capability in Code Snippet Adaptation Through Prompt Engineering [19.019004855931676]
Large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. Their performance on adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. We propose an interactive prompting approach to elicit LLMs' adaptation ability.
arXiv Detail & Related papers (2024-11-23T09:40:36Z)
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [61.88942482411035]
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs) ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
arXiv Detail & Related papers (2023-06-06T01:26:53Z)
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [7.388002745070808]
We study how breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into Large Language Models can be effective. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%.
arXiv Detail & Related papers (2023-04-21T15:02:18Z)
Re-Evaluating LiDAR Scene Flow for Autonomous Driving [80.37947791534985]
Popular benchmarks for self-supervised LiDAR scene flow have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns. We evaluate a suite of top methods on a suite of real-world datasets. We show that despite the emphasis placed on learning, most performance gains are caused by pre- and post-processing steps.
arXiv Detail & Related papers (2023-04-04T22:45:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.