Related papers: Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

URL: http://arxiv.org/abs/2509.18847v2
Date: Thu, 25 Sep 2025 14:17:18 GMT
Title: Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Authors: Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Junfeng Luo, Yurui Qiu,
Abstract summary: Current self-reflection practices rely on prompts or one-way reasoning.<n>We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action.<n>Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls.
Score: 10.598440138966028
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to 'think more' instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

Related papers

OCR-Agent: Agentic OCR with Capability and Memory Reflection [5.8505408398110434]
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks.<n>We propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection.<n> Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets.
arXiv Detail & Related papers (2026-02-24T16:10:27Z)
ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
ToolCritic: Detecting and Correcting Tool-Use Errors in Dialogue Systems [4.930296454541593]
ToolCritic is a framework that evaluates and improves tool usage in multi-turn, tool-augmented dialogues.<n>Trials show ToolCritic improves tool-calling accuracy by up to 13% over baselines.
arXiv Detail & Related papers (2025-10-19T23:42:39Z)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z)
SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection [14.40651157974557]
SAMULE is a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis.<n>It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures.
arXiv Detail & Related papers (2025-09-24T21:02:15Z)
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates [56.73907811047611]
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities.<n>LLMs often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent.<n>We introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings.
arXiv Detail & Related papers (2025-09-22T17:55:14Z)
Reinforcement Learning for Machine Learning Engineering Agents [52.03168614623642]
We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
arXiv Detail & Related papers (2025-09-01T18:04:10Z)
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [63.2198957755528]
We propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations.<n>Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories.<n>Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback.
arXiv Detail & Related papers (2025-06-05T04:35:49Z)
MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning [33.009759731505746]
Complex tasks involving tool integration pose significant challenges for Large Language Models.<n> Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic benchmarks.<n>We propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory.
arXiv Detail & Related papers (2025-05-27T03:37:33Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training [18.896813839389893]
We propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly.<n>Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones.<n>Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction.
arXiv Detail & Related papers (2025-01-20T11:46:04Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
Efficient Tool Use with Chain-of-Abstraction Reasoning [63.08202389132155]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.<n>There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.<n>We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.