Model-based Reinforcement Learning for Continuous Control with Posterior
Sampling
- URL: http://arxiv.org/abs/2012.09613v2
- Date: Tue, 16 Nov 2021 20:29:49 GMT
- Title: Model-based Reinforcement Learning for Continuous Control with Posterior
Sampling
- Authors: Ying Fan, Yifei Ming
- Abstract summary: We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces.
We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
- Score: 10.91557009257615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Balancing exploration and exploitation is crucial in reinforcement learning
(RL). In this paper, we study model-based posterior sampling for reinforcement
learning (PSRL) in continuous state-action spaces theoretically and
empirically. First, we show the first regret bound of PSRL in continuous spaces
which is polynomial in the episode length to the best of our knowledge. With
the assumption that reward and transition functions can be modeled by Bayesian
linear regression, we develop a regret bound of $\tilde{O}(H^{3/2}d\sqrt{T})$,
where $H$ is the episode length, $d$ is the dimension of the state-action
space, and $T$ indicates the total time steps. This result matches the
best-known regret bound of non-PSRL methods in linear MDPs. Our bound can be
extended to nonlinear cases as well with feature embedding: using linear
kernels on the feature representation $\phi$, the regret bound becomes
$\tilde{O}(H^{3/2}d_{\phi}\sqrt{T})$, where $d_\phi$ is the dimension of the
representation space. Moreover, we present MPC-PSRL, a model-based posterior
sampling algorithm with model predictive control for action selection. To
capture the uncertainty in models, we use Bayesian linear regression on the
penultimate layer (the feature representation layer $\phi$) of neural networks.
Empirical results show that our algorithm achieves the state-of-the-art sample
efficiency in benchmark continuous control tasks compared to prior model-based
algorithms, and matches the asymptotic performance of model-free algorithms.
Related papers
- Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.
The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - Prior-dependent analysis of posterior sampling reinforcement learning with function approximation [19.505117288012148]
This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs.
We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL)
We present an upper bound of $mathcalO(dsqrtH3 T log T)$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions.
arXiv Detail & Related papers (2024-03-17T11:23:51Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP)
We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model.
To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Exponential Family Model-Based Reinforcement Learning via Score Matching [97.31477125728844]
We propose an optimistic model-based algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL)
SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression.
arXiv Detail & Related papers (2021-12-28T15:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.