Related papers: Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

URL: http://arxiv.org/abs/2601.15625v1
Date: Thu, 22 Jan 2026 03:57:35 GMT
Title: Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors
Authors: Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong,
Abstract summary: We propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop.<n>Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator.<n>On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain.
Score: 41.78467154106763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.

Related papers

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning [4.765206163164323]
CLEANER exploits intrinsic self-correction capabilities to eliminate error-contaminated context during data collection.<n>Similarity-Aware Adaptive Rollback mechanism autonomously constructs clean, purified trajectories.<n>Results show average accuracy gains of 6%, 3%, and 5% over baselines.
arXiv Detail & Related papers (2026-01-21T16:14:30Z)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases [2.3181214107210235]
PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection.<n>Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence.<n>This approach generalizes to novel failures beyond the training distribution.
arXiv Detail & Related papers (2025-09-25T10:37:30Z)
Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models [0.7910367295422812]
Large language models (LLMs) make mistakes and can explore unproductive reasoning paths.<n>Self-correction capability is essential for deploying LLMs in safety-critical applications.<n>We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources.
arXiv Detail & Related papers (2025-07-03T16:41:30Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.