ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
- URL: http://arxiv.org/abs/2603.01620v3
- Date: Thu, 05 Mar 2026 10:21:10 GMT
- Title: ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
- Authors: Pengbo Liu,
- Abstract summary: We present ToolRLA, a post-training pipeline for domain-specific tool agents.<n>The core contribution is a fine-grained reward function with multiplicative correctness decomposition.<n> ToolRLA achieves over three months: a 47% improvement in task completion rate.
- Score: 1.8379860135249093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT -> GRPO -> DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47% improvement in task completion rate (62%->91%), a 63% reduction in tool invocation errors (38%->14%), and a 93% reduction in regulatory violations (12%->0.8%), within sub-2-second latency. Ablation studies show the multiplicative reward design accounts for 7 percentage points of improvement over additive alternatives. Generalization is further validated on ToolBench and API-Bank.
Related papers
- Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents [54.18201810286764]
Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering.<n>In long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance.<n>We propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior.
arXiv Detail & Related papers (2026-02-02T12:52:14Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z) - AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI [5.165179548592513]
AgentChangeBench is a benchmark designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts.<n>Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and GoalShift Recovery Time (GSRT) for adaptation.
arXiv Detail & Related papers (2025-10-20T23:48:07Z) - PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases [2.3181214107210235]
PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection.<n>Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence.<n>This approach generalizes to novel failures beyond the training distribution.
arXiv Detail & Related papers (2025-09-25T10:37:30Z) - OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models [3.7202906625021934]
Large language models (LLMs) demonstrate strong mathematical reasoning.<n>We introduce OR-Toolformer, which fine-tunes Llama-3.1-8B-Instruct with a semi-automatic data synthesis pipeline.<n>On three of four standard benchmarks, OR-Toolformer achieves up to 80.1% execution accuracy.
arXiv Detail & Related papers (2025-09-24T14:42:40Z) - Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [63.2198957755528]
We propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations.<n>Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories.<n>Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback.
arXiv Detail & Related papers (2025-06-05T04:35:49Z) - OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents [8.441638148384389]
We introduce OptimAI, a framework for solving Optimization problems described in natural language.<n>Our framework is built upon the following key roles: formulator, planner, coder and code critic.<n>Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.
arXiv Detail & Related papers (2025-04-23T17:45:05Z) - Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.