Related papers: Few-shot In-Context Preference Learning Using Large Language Models

Few-shot In-Context Preference Learning Using Large Language Models

URL: http://arxiv.org/abs/2410.17233v1
Date: Tue, 22 Oct 2024 17:53:34 GMT
Title: Few-shot In-Context Preference Learning Using Large Language Models
Authors: Chao Yu, Hong Lu, Jiaxuan Gao, Qixin Tan, Xinting Yang, Yu Wang, Yi Wu, Eugene Vinitsky,
Abstract summary: Designing reward functions is a core component of reinforcement learning. It can be exceedingly inefficient to learn rewards as they are often learned tabula rasa. We propose In-Context Preference Learning (ICPL) to accelerate learning reward functions from preferences.
Score: 15.84585737510038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Designing reward functions is a core component of reinforcement learning but can be challenging for truly complex behavior. Reinforcement Learning from Human Feedback (RLHF) has been used to alleviate this challenge by replacing a hand-coded reward function with a reward function learned from preferences. However, it can be exceedingly inefficient to learn these rewards as they are often learned tabula rasa. We investigate whether Large Language Models (LLMs) can reduce this query inefficiency by converting an iterative series of human preferences into code representing the rewards. We propose In-Context Preference Learning (ICPL), a method that uses the grounding of an LLM to accelerate learning reward functions from preferences. ICPL takes the environment context and task description, synthesizes a set of reward functions, and then repeatedly updates the reward functions using human rankings of videos of the resultant policies. Using synthetic preferences, we demonstrate that ICPL is orders of magnitude more efficient than RLHF and is even competitive with methods that use ground-truth reward functions instead of preferences. Finally, we perform a series of human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop. Additional information and videos are provided at https://sites.google.com/view/few-shot-icpl/home.

Related papers

LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning [13.035613181550941]
Large Language Model-Assisted Preference Prediction (LAPP) is a novel framework for robot learning. LAPP enables efficient, customizable, and expressive behavior acquisition with minimum human effort. We show that LAPP achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors.
arXiv Detail & Related papers (2025-04-21T22:46:29Z)
Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning [16.654435148168172]
Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making. We propose an LLM-guided hierarchical RL framework, termed LDSC, to enhance sample efficiency, generalization, and multi-task adaptability.
arXiv Detail & Related papers (2025-03-24T15:49:56Z)
From Selection to Generation: A Survey of LLM-based Active Learning [153.8110509961261]
Large Language Models (LLMs) have been employed for generating entirely new data instances and providing more cost-effective annotations. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques.
arXiv Detail & Related papers (2025-02-17T12:58:17Z)
Real-Time Personalization for LLM-based Recommendation with Customized In-Context Learning [57.28766250993726]
This work explores adapting to dynamic user interests without any model updates. Existing Large Language Model (LLM)-based recommenders often lose the in-context learning ability during recommendation tuning. We propose RecICL, which customizes recommendation-specific in-context learning for real-time recommendations.
arXiv Detail & Related papers (2024-10-30T15:48:36Z)
Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks. PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals. We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z)
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment [11.639973274337274]
Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. We propose a method to learn rewards more efficiently in the absence of humans.
arXiv Detail & Related papers (2024-05-12T04:57:43Z)
Enhancing Q-Learning with Large Language Model Heuristics [0.0]
Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. We propose textbfLLM-guided Q-learning, a framework that leverages LLMs as hallucinations to aid in learning the Q-function for reinforcement learning.
arXiv Detail & Related papers (2024-05-06T10:42:28Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)
Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z)
Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization. We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback? [10.968490626773564]
We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs) Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences.
arXiv Detail & Related papers (2023-06-22T16:04:16Z)
Inverse Preference Learning: Preference-based RL without a Reward Function [34.31087304327075]
Inverse Preference Learning (IPL) is specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable. IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions.
arXiv Detail & Related papers (2023-05-24T17:14:10Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL) In this paper, we consider the problem of adaptively utilizing a given shaping reward function. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations. We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.