Sparse Rewards Can Self-Train Dialogue Agents
- URL: http://arxiv.org/abs/2409.04617v2
- Date: Tue, 8 Oct 2024 20:40:59 GMT
- Title: Sparse Rewards Can Self-Train Dialogue Agents
- Authors: Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang,
- Abstract summary: We introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback.
We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ.
We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks.
- Score: 22.799506097310008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub at https://github.com/asappresearch/josh-llm-simulation-training
Related papers
- GenSim: A General Social Simulation Platform with Large Language Model based Agents [111.00666003559324]
We propose a novel large language model (LLMs)-based simulation platform called textitGenSim.
Our platform supports one hundred thousand agents to better simulate large-scale populations in real-world contexts.
To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform.
arXiv Detail & Related papers (2024-10-06T05:02:23Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments.
We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques.
We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z) - LLMArena: Assessing Capabilities of Large Language Models in Dynamic
Multi-Agent Environments [35.926581910260076]
We introduce LLMArena, a framework for evaluating the capabilities of large language models in multi-agent dynamic environments.
LLArena employs Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration.
We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents.
arXiv Detail & Related papers (2024-02-26T11:31:48Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - Aligning Large Language Models through Synthetic Feedback [43.84431341195111]
We propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations.
In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively.
arXiv Detail & Related papers (2023-05-23T06:41:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.