An Experimental Design Perspective on Model-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2112.05244v1
- Date: Thu, 9 Dec 2021 23:13:57 GMT
- Title: An Experimental Design Perspective on Model-Based Reinforcement Learning
- Authors: Viraj Mehta and Biswajit Paria and Jeff Schneider and Stefano Ermon
and Willie Neiswanger
- Abstract summary: In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
- Score: 73.37942845983417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many practical applications of RL, it is expensive to observe state
transitions from the environment. For example, in the problem of plasma control
for nuclear fusion, computing the next state for a given state-action pair
requires querying an expensive transition function which can lead to many hours
of computer simulation or dollars of scientific research. Such expensive data
collection prohibits application of standard RL algorithms which usually
require a large number of observations to learn. In this work, we address the
problem of efficiently learning a policy while making a minimal number of
state-action queries to the transition function. In particular, we leverage
ideas from Bayesian optimal experimental design to guide the selection of
state-action queries for efficient learning. We propose an acquisition function
that quantifies how much information a state-action pair would provide about
the optimal solution to a Markov decision process. At each iteration, our
algorithm maximizes this acquisition function, to choose the most informative
state-action pair to be queried, thus yielding a data-efficient RL approach. We
experiment with a variety of simulated continuous control problems and show
that our approach learns an optimal policy with up to $5$ -- $1,000\times$ less
data than model-based RL baselines and $10^3$ -- $10^5\times$ less data than
model-free RL baselines. We also provide several ablated comparisons which
point to substantial improvements arising from the principled method of
obtaining data.
Related papers
- Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Offline Inverse Reinforcement Learning [24.316047317028147]
offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available.
Inspired by the success of IRL techniques in achieving state of the art imitation performances in online settings, we exploit GAN based data augmentation procedures to construct the first offline IRL algorithm.
arXiv Detail & Related papers (2021-06-09T13:44:06Z) - On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning
Problems in High-dimension [7.200655637873445]
Hamiltonian Monte Carlo (HMC) sampling offers a tractable way to generate data for training RL algorithms.
We introduce a framework, called textitHamiltonian $Q$-Learning, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions.
arXiv Detail & Related papers (2020-11-11T17:35:25Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - State Action Separable Reinforcement Learning [11.04892417160547]
We propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL)
sasRL, wherein the action space is decoupled from the value function learning process for higher efficiency.
Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75%$.
arXiv Detail & Related papers (2020-06-05T22:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.