Tool-Augmented Reward Modeling
- URL: http://arxiv.org/abs/2310.01045v2
- Date: Sun, 11 Feb 2024 16:58:02 GMT
- Title: Tool-Augmented Reward Modeling
- Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua
Wu
- Abstract summary: We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
- Score: 58.381678612409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward modeling (a.k.a., preference modeling) is instrumental for aligning
large language models with human preferences, particularly within the context
of reinforcement learning from human feedback (RLHF). While conventional reward
models (RMs) have exhibited remarkable scalability, they oft struggle with
fundamental functionality such as arithmetic computation, code execution, and
factual lookup. In this paper, we propose a tool-augmented preference modeling
approach, named Themis, to address these limitations by empowering RMs with
access to external environments, including calculators and search engines. This
approach not only fosters synergy between tool utilization and reward grading
but also enhances interpretive capacity and scoring reliability. Our study
delves into the integration of external tools into RMs, enabling them to
interact with diverse external sources and construct task-specific tool
engagement and reasoning traces in an autoregressive manner. We validate our
approach across a wide range of domains, incorporating seven distinct external
tools. Our experimental results demonstrate a noteworthy overall improvement of
17.7% across eight tasks in preference ranking. Furthermore, our approach
outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In
human evaluations, RLHF trained with Themis attains an average win rate of 32%
when compared to baselines across four distinct tasks. Additionally, we provide
a comprehensive collection of tool-related RM datasets, incorporating data from
seven distinct tool APIs, totaling 15,000 instances. We have made the code,
data, and model checkpoints publicly available to facilitate and inspire
further research
advancements\footnote{\url{https://github.com/ernie-research/Tool-Augmented-Reward-Model}}.
Related papers
- Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models [28.67532617021655]
Large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning.
Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints.
We propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task.
arXiv Detail & Related papers (2024-10-04T07:58:05Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments.
We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques.
We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z) - ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [170.7899683843177]
ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems.
ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales.
ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
arXiv Detail & Related papers (2023-09-29T17:59:38Z) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval.
We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup.
Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.