Related papers: Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

URL: http://arxiv.org/abs/2402.11651v2
Date: Tue, 16 Apr 2024 11:41:13 GMT
Title: Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
Authors: Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin,
Abstract summary: Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. Previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models. We argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies.
Score: 41.14201835950814
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.

Related papers

STeCa: Step-level Trajectory Calibration for LLM Agent Learning [9.678098205818566]
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. We propose Step-Level Trajectory (STeCa), a novel framework for LLM agent learning.
arXiv Detail & Related papers (2025-02-20T05:28:44Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Training Agents with Weakly Supervised Feedback from Large Language Models [19.216542820742607]
This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4.
arXiv Detail & Related papers (2024-11-29T08:47:04Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA) Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning. We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents [49.85633804913796]
We present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin.
arXiv Detail & Related papers (2024-03-04T21:50:29Z)
Pedagogical Alignment of Large Language Models [24.427653091950994]
Large Language Models (LLMs) provide immediate answers rather than guiding students through the problem-solving process. This paper investigates Learning from Human Preferences (LHP) algorithms to achieve this alignment objective.
arXiv Detail & Related papers (2024-02-07T16:15:59Z)
Unlearnable Algorithms for In-context Learning [36.895152458323764]
In this paper, we focus on efficient unlearning methods for the task adaptation phase of a pretrained large language model. We observe that an LLM's ability to do in-context learning for task adaptation allows for efficient exact unlearning of task adaptation training data. We propose a new holistic measure of unlearning cost which accounts for varying inference costs.
arXiv Detail & Related papers (2024-02-01T16:43:04Z)
LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z)
Contrastive Trajectory Similarity Learning with Dual-Feature Attention [24.445998309807965]
Tray similarity measures act as query predicates in trajectory databases. We propose a contrastive learning-based trajectory modelling method named TrajCL. TrajCL is consistently and significantly more accurate and faster than the state-of-the-art trajectory similarity measures.
arXiv Detail & Related papers (2022-10-11T05:25:14Z)
TRAIL: Near-Optimal Imitation Learning with Suboptimal Data [100.83688818427915]
We present training objectives that use offline datasets to learn a factored transition model. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model.
arXiv Detail & Related papers (2021-10-27T21:05:00Z)
Few-shot Quality-Diversity Optimization [50.337225556491774]
Quality-Diversity (QD) optimization has been shown to be effective tools in dealing with deceptive minima and sparse rewards in Reinforcement Learning. We show that, given examples from a task distribution, information about the paths taken by optimization in parameter space can be leveraged to build a prior population, which when used to initialize QD methods in unseen environments, allows for few-shot adaptation. Experiments carried in both sparse and dense reward settings using robotic manipulation and navigation benchmarks show that it considerably reduces the number of generations that are required for QD optimization in these environments.
arXiv Detail & Related papers (2021-09-14T17:12:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.