Efficient Preference-Based Reinforcement Learning Using Learned Dynamics
Models
- URL: http://arxiv.org/abs/2301.04741v2
- Date: Fri, 9 Feb 2024 20:44:22 GMT
- Title: Efficient Preference-Based Reinforcement Learning Using Learned Dynamics
Models
- Authors: Yi Liu, Gaurav Datta, Ellen Novoseller, Daniel S. Brown
- Abstract summary: Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences.
We study the benefits and challenges of using a learned dynamics model when performing PbRL.
- Score: 13.077993395762185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preference-based reinforcement learning (PbRL) can enable robots to learn to
perform tasks based on an individual's preferences without requiring a
hand-crafted reward function. However, existing approaches either assume access
to a high-fidelity simulator or analytic model or take a model-free approach
that requires extensive, possibly unsafe online environment interactions. In
this paper, we study the benefits and challenges of using a learned dynamics
model when performing PbRL. In particular, we provide evidence that a learned
dynamics model offers the following benefits when performing PbRL: (1)
preference elicitation and policy optimization require significantly fewer
environment interactions than model-free PbRL, (2) diverse preference queries
can be synthesized safely and efficiently as a byproduct of standard
model-based RL, and (3) reward pre-training based on suboptimal demonstrations
can be performed without any environmental interaction. Our paper provides
empirical evidence that learned dynamics models enable robots to learn
customized policies based on user preferences in ways that are safer and more
sample efficient than prior preference learning approaches. Supplementary
materials and code are available at
https://sites.google.com/berkeley.edu/mop-rl.
Related papers
- Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels.
We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model.
We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Data-Efficient Task Generalization via Probabilistic Model-based Meta
Reinforcement Learning [58.575939354953526]
PACOH-RL is a novel model-based Meta-Reinforcement Learning (Meta-RL) algorithm designed to efficiently adapt control policies to changing dynamics.
Existing Meta-RL methods require abundant meta-learning data, limiting their applicability in settings such as robotics.
Our experiment results demonstrate that PACOH-RL outperforms model-based RL and model-based Meta-RL baselines in adapting to new dynamic conditions.
arXiv Detail & Related papers (2023-11-13T18:51:57Z) - CostNet: An End-to-End Framework for Goal-Directed Reinforcement
Learning [9.432068833600884]
Reinforcement Learning (RL) is a general framework concerned with an agent that seeks to maximize rewards in an environment.
There are two approaches, model-based and model-free reinforcement learning, that show concrete results in several disciplines.
This paper introduces a novel reinforcement learning algorithm for predicting the distance between two states in a Markov Decision Process.
arXiv Detail & Related papers (2022-10-03T21:16:14Z) - Physics-informed Dyna-Style Model-Based Deep Reinforcement Learning for
Dynamic Control [1.8275108630751844]
We propose to leverage the prior knowledge of underlying physics of the environment, where the governing laws are (partially) known.
By incorporating the prior information of the environment, the quality of the learned model can be notably improved.
arXiv Detail & Related papers (2021-07-31T02:19:36Z) - Generative Adversarial Reward Learning for Generalized Behavior Tendency
Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling.
Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z) - Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with
Deep Reinforcement Learning [42.525696463089794]
Model Predictive Actor-Critic (MoPAC) is a hybrid model-based/model-free method that combines model predictive rollouts with policy optimization as to mitigate model bias.
MoPAC guarantees optimal skill learning up to an approximation error and reduces necessary physical interaction with the environment.
arXiv Detail & Related papers (2021-03-25T13:50:24Z) - Information Theoretic Model Predictive Q-Learning [64.74041985237105]
We present a novel theoretical connection between information theoretic MPC and entropy regularized RL.
We develop a Q-learning algorithm that can leverage biased models.
arXiv Detail & Related papers (2019-12-31T00:29:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.