One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
- URL: http://arxiv.org/abs/2510.26167v1
- Date: Thu, 30 Oct 2025 06:08:27 GMT
- Title: One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
- Authors: Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang,
- Abstract summary: Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
- Score: 54.580646706013965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
Related papers
- Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control [82.30868101940068]
We propose a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models.<n>Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape.<n>We show that even smaller models benefit, and weak models can generate effective training curricula for stronger ones.
arXiv Detail & Related papers (2026-02-17T15:31:32Z) - D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use [17.99381644283042]
Large reasoning models (LRMs) lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning.<n>We propose a two-stage training framework that incentivizes LRMs' task decomposition reasoning capability via self-distillation and diversity-aware reinforcement learning.<n>D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales.
arXiv Detail & Related papers (2026-02-02T14:36:15Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning [41.49024599460379]
Reward models (RMs) have become essential for aligning large language models (LLMs)<n>We introduce OpenRM, a tool-augmented long-form reward model that judges open-ended responses by invoking external tools to gather relevant evidence.<n>Experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches.
arXiv Detail & Related papers (2025-10-28T17:02:46Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning [68.89572566071575]
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools.<n>We propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately.<n> Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light.
arXiv Detail & Related papers (2025-09-27T12:53:37Z) - ToolRM: Outcome Reward Models for Tool-Calling Large Language Models [18.60378078755052]
We introduce FC-RewardBench, the first benchmark designed to assess reward models' performance in tool-calling scenarios.<n>Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling.<n>We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks.
arXiv Detail & Related papers (2025-09-15T14:17:17Z) - ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [0.0]
We introduce ToolComp, a benchmark designed to evaluate multi-step tool-use reasoning.<n>ToolComp is developed through a collaboration between models and human annotators.<n>We generate synthetic training data to compare the performance of outcome-supervised reward models with process-supervised reward models.
arXiv Detail & Related papers (2025-01-02T15:10:52Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.