A Provably Efficient Model-Free Posterior Sampling Method for Episodic
Reinforcement Learning
- URL: http://arxiv.org/abs/2208.10904v1
- Date: Tue, 23 Aug 2022 12:21:01 GMT
- Title: A Provably Efficient Model-Free Posterior Sampling Method for Episodic
Reinforcement Learning
- Authors: Christoph Dann, Mehryar Mohri, Tong Zhang, Julian Zimmert
- Abstract summary: Existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs.
This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees.
- Score: 50.910152564914405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thompson Sampling is one of the most effective methods for contextual bandits
and has been generalized to posterior sampling for certain MDP settings.
However, existing posterior sampling methods for reinforcement learning are
limited by being model-based or lack worst-case theoretical guarantees beyond
linear MDPs. This paper proposes a new model-free formulation of posterior
sampling that applies to more general episodic reinforcement learning problems
with theoretical guarantees. We introduce novel proof techniques to show that
under suitable conditions, the worst-case regret of our posterior sampling
method matches the best known results of optimization based methods. In the
linear MDP setting with dimension, the regret of our algorithm scales linearly
with the dimension as compared to a quadratic dependence of the existing
posterior sampling-based exploration algorithms.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance [52.093434664236014]
Recent diffusion models provide a promising zero-shot solution to noisy linear inverse problems without retraining for specific inverse problems.
Inspired by this finding, we propose to improve recent methods by using more principled covariance determined by maximum likelihood estimation.
arXiv Detail & Related papers (2024-02-03T13:35:39Z) - On Sample-Efficient Offline Reinforcement Learning: Data Diversity,
Posterior Sampling, and Beyond [29.449446595110643]
We propose a notion of data diversity that subsumes the previous notions of coverage measures in offline RL.
Our proposed model-free PS-based algorithm for offline RL is novel, with sub-optimality bounds that are frequentist (i.e., worst-case) in nature.
arXiv Detail & Related papers (2024-01-06T20:52:04Z) - Solving Linear Inverse Problems Provably via Posterior Sampling with
Latent Diffusion Models [98.95988351420334]
We present the first framework to solve linear inverse problems leveraging pre-trained latent diffusion models.
We theoretically analyze our algorithm showing provable sample recovery in a linear model setting.
arXiv Detail & Related papers (2023-07-02T17:21:30Z) - Posterior Sampling for Deep Reinforcement Learning [0.0]
This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning.
Experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling.
arXiv Detail & Related papers (2023-04-30T13:23:50Z) - Plug-and-Play split Gibbs sampler: embedding deep generative priors in
Bayesian inference [12.91637880428221]
This paper introduces a plug-and-play sampling algorithm that leverages variable splitting to efficiently sample from a posterior distribution.
It divides the challenging task of posterior sampling into two simpler sampling problems.
Its performance is compared to recent state-of-the-art optimization and sampling methods.
arXiv Detail & Related papers (2023-04-21T17:17:51Z) - Model Agnostic Sample Reweighting for Out-of-Distribution Learning [38.843552982739354]
We propose a principled method, textbfAgnostic samtextbfPLe rtextbfEweighting (textbfMAPLE) to effectively address OOD problem.
Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model leads to superior OOD generalization performance.
arXiv Detail & Related papers (2023-01-24T05:11:03Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Learning the Truth From Only One Side of the Story [58.65439277460011]
We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution.
We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically.
arXiv Detail & Related papers (2020-06-08T18:20:28Z) - Robust Q-learning [0.0]
We propose a robust Q-learning approach which allows estimating nuisance parameters using data-adaptive techniques.
We study the behavior of our estimators and provide simulation studies that highlight the need for and usefulness of the proposed method.
arXiv Detail & Related papers (2020-03-27T14:10:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.