Bayes-Adaptive Deep Model-Based Policy Optimisation
- URL: http://arxiv.org/abs/2010.15948v3
- Date: Mon, 4 Jan 2021 21:03:42 GMT
- Title: Bayes-Adaptive Deep Model-Based Policy Optimisation
- Authors: Tai Hoang and Ngo Anh Vien
- Abstract summary: We introduce a Bayesian (deep) model-based reinforcement learning method (RoMBRL) that can capture model uncertainty to achieve sample-efficient policy optimisation.
We propose to formulate the model-based policy optimisation problem as a Bayes-adaptive Markov decision process (BAMDP)
We show that RoMBRL outperforms existing approaches on many challenging control benchmark tasks in terms of sample complexity and task performance.
- Score: 4.675381958034012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a Bayesian (deep) model-based reinforcement learning method
(RoMBRL) that can capture model uncertainty to achieve sample-efficient policy
optimisation. We propose to formulate the model-based policy optimisation
problem as a Bayes-adaptive Markov decision process (BAMDP). RoMBRL maintains
model uncertainty via belief distributions through a deep Bayesian neural
network whose samples are generated via stochastic gradient Hamiltonian Monte
Carlo. Uncertainty is propagated through simulations controlled by sampled
models and history-based policies. As beliefs are encoded in visited histories,
we propose a history-based policy network that can be end-to-end trained to
generalise across history space and will be trained using recurrent
Trust-Region Policy Optimisation. We show that RoMBRL outperforms existing
approaches on many challenging control benchmark tasks in terms of sample
complexity and task performance. The source code of this paper is also publicly
available on https://github.com/thobotics/RoMBRL.
Related papers
- Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning [5.663006149337036]
offline model-based reinforcement learning (MBRL) is a powerful approach for data-driven decision-making and control.
There could be various MDPs that behave identically on the offline dataset and so dealing with the uncertainty about the true MDP can be challenging.
We introduce a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces.
arXiv Detail & Related papers (2024-10-15T03:36:43Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL)
The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics.
Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z) - Bayesian regularization of empirical MDPs [11.3458118258705]
We take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information.
We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store.
arXiv Detail & Related papers (2022-08-03T22:02:50Z) - Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy [13.819070455425075]
We learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies.
We then propose a novel dynamics model learning method, named textitPolicy-adapted Dynamics Model Learning (PDML).
Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher performance combined with the state-of-the-art model-based RL methods.
arXiv Detail & Related papers (2022-07-25T12:45:58Z) - Learning Robust Controllers Via Probabilistic Model-Based Policy Search [2.886634516775814]
We investigate whether controllers learned in such a way are robust and able to generalize under small perturbations of the environment.
We show that enforcing a lower bound to the likelihood noise in the Gaussian Process dynamics model regularizes the policy updates and yields more robust controllers.
arXiv Detail & Related papers (2021-10-26T11:17:31Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Variational Model-based Policy Optimization [34.80171122943031]
Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL.
We propose an objective function as a variational lower-bound of a log-likelihood of a log-likelihood to jointly learn and improve model and policy.
Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called emactoral model-based policy optimization (VMBPO), is more sample-efficient and
arXiv Detail & Related papers (2020-06-09T18:30:15Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.