Recursive Introspection: Teaching Language Model Agents How to Self-Improve
- URL: http://arxiv.org/abs/2407.18219v1
- Date: Thu, 25 Jul 2024 17:35:59 GMT
- Title: Recursive Introspection: Teaching Language Model Agents How to Self-Improve
- Authors: Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar,
- Abstract summary: We develop RISE: Recursive IntroSpEction, an approach for fine-tuning large language models.
Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks.
- Score: 30.086494067593268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A central piece in enabling intelligent agentic behavior in foundation models is to make them capable of introspecting upon their behavior, reasoning, and correcting their mistakes as more computation or interaction is available. Even the strongest proprietary large language models (LLMs) do not quite exhibit the ability of continually improving their responses sequentially, even in scenarios where they are explicitly told that they are making a mistake. In this paper, we develop RISE: Recursive IntroSpEction, an approach for fine-tuning LLMs to introduce this capability, despite prior work hypothesizing that this capability may not be possible to attain. Our approach prescribes an iterative fine-tuning procedure, which attempts to teach the model how to alter its response after having executed previously unsuccessful attempts to solve a hard test-time problem, with optionally additional environment feedback. RISE poses fine-tuning for a single-turn prompt as solving a multi-turn Markov decision process (MDP), where the initial state is the prompt. Inspired by principles in online imitation learning and reinforcement learning, we propose strategies for multi-turn data collection and training so as to imbue an LLM with the capability to recursively detect and correct its previous mistakes in subsequent iterations. Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation. We also find that RISE scales well, often attaining larger benefits with more capable models. Our analysis shows that RISE makes meaningful improvements to responses to arrive at the correct solution for challenging prompts, without disrupting one-turn abilities as a result of expressing more complex distributions.
Related papers
- LLMs can learn self-restraint through iterative self-reflection [57.26854891567574]
Large Language Models (LLMs) must be capable of dynamically adapting their behavior based on their level of knowledge and uncertainty associated with specific topics.
This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach.
We devise a utility function that can encourage the model to produce responses only when it is confident in them.
arXiv Detail & Related papers (2024-05-15T13:35:43Z) - AlphaMath Almost Zero: process Supervision without process [6.318873143509028]
Large language models (LLMs) struggle with complex problems that require multiple reasoning steps.
We introduce an innovative approach that bypasses the need for process annotations (from human or GPTs) by utilizing the Monte Carlo Tree Search (MCTS) framework.
Our method iteratively trains the policy and value models, leveraging the capabilities of a well-pretrained LLM.
arXiv Detail & Related papers (2024-05-06T15:20:30Z) - Reframing Offline Reinforcement Learning as a Regression Problem [0.0]
The study proposes the reformulation of offline reinforcement learning as a regression problem that can be solved with decision trees.
We observe that with gradient-boosted trees, the agent training and inference are very fast.
Despite the simplification inherent in this reformulated problem, our agent demonstrates performance that is at least on par with established methods.
arXiv Detail & Related papers (2024-01-21T23:50:46Z) - ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent [50.508669199496474]
We develop a ReAct-style LLM agent with the ability to reason and act upon external knowledge.
We refine the agent through a ReST-like method that iteratively trains on previous trajectories.
Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model.
arXiv Detail & Related papers (2023-12-15T18:20:15Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient
Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text.
This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion.
We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Sequential Transfer in Reinforcement Learning with a Generative Model [48.40219742217783]
We show how to reduce the sample complexity for learning new tasks by transferring knowledge from previously-solved ones.
We derive PAC bounds on its sample complexity which clearly demonstrate the benefits of using this kind of prior knowledge.
We empirically verify our theoretical findings in simple simulated domains.
arXiv Detail & Related papers (2020-07-01T19:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.