Related papers: Tool-Augmented Reward Modeling

Tool-Augmented Reward Modeling

URL: http://arxiv.org/abs/2310.01045v2
Date: Sun, 11 Feb 2024 16:58:02 GMT
Title: Tool-Augmented Reward Modeling
Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu
Abstract summary: We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
Score: 58.381678612409
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. We validate our approach across a wide range of domains, incorporating seven distinct external tools. Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. Additionally, we provide a comprehensive collection of tool-related RM datasets, incorporating data from seven distinct tool APIs, totaling 15,000 instances. We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements\footnote{\url{https://github.com/ernie-research/Tool-Augmented-Reward-Model}}.

Related papers

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [27.07998056454784]
ReTool enhances long-form reasoning with tool-integrated learning. Model achieves 67% accuracy with 400 training steps. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings.
arXiv Detail & Related papers (2025-04-15T18:10:22Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM. START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [0.0]
We introduce ToolComp, a benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators. We generate synthetic training data to compare the performance of outcome-supervised reward models with process-supervised reward models.
arXiv Detail & Related papers (2025-01-02T15:10:52Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales. We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models [28.67532617021655]
Large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning. Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints. We propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task.
arXiv Detail & Related papers (2024-10-04T07:58:05Z)
Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance. Existing direct preference learning algorithms are originally designed for the single-turn chat task. We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques. We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z)
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [170.7899683843177]
ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems. ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales. ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
arXiv Detail & Related papers (2023-09-29T17:59:38Z)
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup. Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.