Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement
Learning with Clairvoyant Experts
- URL: http://arxiv.org/abs/2002.03042v1
- Date: Fri, 7 Feb 2020 23:10:05 GMT
- Title: Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement
Learning with Clairvoyant Experts
- Authors: Gilwoo Lee, Brian Hou, Sanjiban Choudhury, Siddhartha S. Srinivasa
- Abstract summary: We formulate this as Bayesian Reinforcement Learning over latent Markov Decision Processes (MDPs)
We first obtain an ensemble of experts, one for each latent MDP, and fuse their advice to compute a baseline policy.
Next, we train a Bayesian residual policy to improve upon the ensemble's recommendation and learn to reduce uncertainty.
BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods.
- Score: 22.87432549580184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Informed and robust decision making in the face of uncertainty is critical
for robots that perform physical tasks alongside people. We formulate this as
Bayesian Reinforcement Learning over latent Markov Decision Processes (MDPs).
While Bayes-optimality is theoretically the gold standard, existing algorithms
do not scale well to continuous state and action spaces. Our proposal builds on
the following insight: in the absence of uncertainty, each latent MDP is easier
to solve. We first obtain an ensemble of experts, one for each latent MDP, and
fuse their advice to compute a baseline policy. Next, we train a Bayesian
residual policy to improve upon the ensemble's recommendation and learn to
reduce uncertainty. Our algorithm, Bayesian Residual Policy Optimization
(BRPO), imports the scalability of policy gradient methods and task-specific
expert skills. BRPO significantly improves the ensemble of experts and
drastically outperforms existing adaptive RL methods.
Related papers
- Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes [7.028778922533688]
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty.
We study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning.
arXiv Detail & Related papers (2024-10-14T14:52:23Z) - Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs [3.1139806580181006]
We address the challenge of quantifying Bayesian uncertainty in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics.
We use standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters.
We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance.
We highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem.
arXiv Detail & Related papers (2024-06-04T16:21:14Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Human-Algorithm Collaborative Bayesian Optimization for Engineering Systems [0.0]
We re-introduce the human back into the data-driven decision making loop by outlining an approach for collaborative Bayesian optimization.
Our methodology exploits the hypothesis that humans are more efficient at making discrete choices rather than continuous ones.
We demonstrate our approach across a number of applied and numerical case studies including bioprocess optimization and reactor geometry design.
arXiv Detail & Related papers (2024-04-16T23:17:04Z) - Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to
Standard RL [48.1726560631463]
We study Risk-Sensitive Reinforcement Learning with the Optimized Certainty Equivalent (OCE) risk.
We propose two general meta-algorithms via reductions to standard RL.
We show that it learns the optimal risk-sensitive policy while prior algorithms provably fail.
arXiv Detail & Related papers (2024-03-10T21:45:12Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Risk-Averse Bayes-Adaptive Reinforcement Learning [3.5289688061934963]
We pose the problem of optimising the conditional value at risk (CVaR) of the total return in Bayes-adaptive Markov decision processes (MDPs)
We show that a policy optimising CVaR in this setting is risk-averse to both the parametric uncertainty due to the prior distribution over MDPs, and the internal uncertainty due to the inherentity of MDPs.
Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.
arXiv Detail & Related papers (2021-02-10T22:34:33Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.