ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL
- URL: http://arxiv.org/abs/2402.19446v1
- Date: Thu, 29 Feb 2024 18:45:56 GMT
- Title: ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL
- Authors: Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar
- Abstract summary: We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
- Score: 80.10358123795946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A broad use case of large language models (LLMs) is in goal-directed
decision-making tasks (or "agent" tasks), where an LLM needs to not just
generate completions for a given prompt, but rather make intelligent decisions
over a multi-turn interaction to accomplish a task (e.g., when interacting with
the web, using tools, or providing customer support). Reinforcement learning
(RL) provides a general paradigm to address such agent tasks, but current RL
methods for LLMs largely focus on optimizing single-turn rewards. By
construction, most single-turn RL methods cannot endow LLMs with the ability to
intelligently seek information over multiple turns, perform credit assignment,
or reason about their past actions -- all of which are critical in agent tasks.
This raises the question: how can we design effective and efficient multi-turn
RL algorithms for LLMs? In this paper, we develop a framework for building
multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility
of existing single-turn RL methods for LLMs (e.g., proximal policy
optimization), while accommodating multiple turns, long horizons, and delayed
rewards effectively. To do this, our framework adopts a hierarchical RL
approach and runs two RL algorithms in parallel: a high-level off-policy
value-based RL algorithm to aggregate reward over utterances, and a low-level
RL algorithm that utilizes this high-level value function to train a token
policy within each utterance or turn. Our hierarchical framework, Actor-Critic
Framework with a Hierarchical Structure (ArCHer), can also give rise to other
RL methods. Empirically, we find that ArCHer significantly improves efficiency
and performance on agent tasks, attaining a sample efficiency of about 100x
over existing methods, while also improving with larger model capacity (upto
the 7 billion scale that we tested on).
Related papers
- RL-GPT: Integrating Reinforcement Learning and Code-as-policy [82.1804241891039]
We introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent.
The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks.
This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline.
arXiv Detail & Related papers (2024-02-29T16:07:22Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Mutual Enhancement of Large Language and Reinforcement Learning Models
through Bi-Directional Feedback Mechanisms: A Case Study [1.3597551064547502]
We employ a teacher-student learning framework to tackle problems of Large Language Models (LLMs) and reinforcement learning (RL) models.
Within this framework, the LLM acts as a teacher, while the RL model acts as a student.
We propose a practical algorithm to address the problem and conduct empirical experiments to evaluate the effectiveness of our method.
arXiv Detail & Related papers (2024-01-12T14:35:57Z) - LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient
Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text.
This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion.
We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z) - RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$ [12.111848705677142]
We propose RL$3$, a hybrid approach that incorporates action-values, learned per task through traditional RL, in the inputs to meta-RL.
We show that RL$3$ earns greater cumulative reward in the long term, compared to RL$2$, while maintaining data-efficiency in the short term, and generalizes better to out-of-distribution tasks.
arXiv Detail & Related papers (2023-06-28T04:16:16Z) - Train Hard, Fight Easy: Robust Meta Reinforcement Learning [78.16589993684698]
A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients.
Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty.
In this work, we define a robust MRL objective with a controlled level.
The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML)
arXiv Detail & Related papers (2023-01-26T14:54:39Z) - A Survey of Meta-Reinforcement Learning [69.76165430793571]
We cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL.
We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task.
We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
arXiv Detail & Related papers (2023-01-19T12:01:41Z) - FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance
Metric Learning and Behavior Regularization [10.243908145832394]
We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks.
This problem is still not fully understood, for which two major challenges need to be addressed.
We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches.
arXiv Detail & Related papers (2020-10-02T17:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.