Probabilistic Inference in Reinforcement Learning Done Right
- URL: http://arxiv.org/abs/2311.13294v1
- Date: Wed, 22 Nov 2023 10:23:14 GMT
- Title: Probabilistic Inference in Reinforcement Learning Done Right
- Authors: Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue
- Abstract summary: A popular perspective in Reinforcement learning casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP)
Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference.
We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret.
- Score: 37.31057328219418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A popular perspective in Reinforcement learning (RL) casts the problem as
probabilistic inference on a graphical model of the Markov decision process
(MDP). The core object of study is the probability of each state-action pair
being visited under the optimal policy. Previous approaches to approximate this
quantity can be arbitrarily poor, leading to algorithms that do not implement
genuine statistical inference and consequently do not perform well in
challenging problems. In this work, we undertake a rigorous Bayesian treatment
of the posterior probability of state-action optimality and clarify how it
flows through the MDP. We first reveal that this quantity can indeed be used to
generate a policy that explores efficiently, as measured by regret.
Unfortunately, computing it is intractable, so we derive a new variational
Bayesian approximation yielding a tractable convex optimization problem and
establish that the resulting policy also explores efficiently. We call our
approach VAPOR and show that it has strong connections to Thompson sampling,
K-learning, and maximum entropy exploration. We conclude with some experiments
demonstrating the performance advantage of a deep RL version of VAPOR.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs [3.1139806580181006]
We address the challenge of quantifying Bayesian uncertainty in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics.
We use standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters.
We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance.
We highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem.
arXiv Detail & Related papers (2024-06-04T16:21:14Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs)
Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks.
We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Beyond the Mean-Field: Structured Deep Gaussian Processes Improve the
Predictive Uncertainties [12.068153197381575]
We propose a novel variational family that allows for retaining covariances between latent processes while achieving fast convergence.
We provide an efficient implementation of our new approach and apply it to several benchmark datasets.
It yields excellent results and strikes a better balance between accuracy and calibrated uncertainty estimates than its state-of-the-art alternatives.
arXiv Detail & Related papers (2020-05-22T11:10:59Z) - Deep Reinforcement Learning with Weighted Q-Learning [43.823659028488876]
Reinforcement learning algorithms based on Q-learning are driving Deep Reinforcement Learning (DRL) research towards solving complex problems.
Q-Learning is known to be positively biased since it learns by using the maximum over noisy estimates of expected values.
We show how our novel Deep Weighted Q-Learning algorithm reduces the bias w.r.t. relevant baselines and provides empirical evidence of its advantages on representative benchmarks.
arXiv Detail & Related papers (2020-03-20T13:57:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.