Related papers: Q-learning with Posterior Sampling

Q-learning with Posterior Sampling

URL: http://arxiv.org/abs/2506.00917v1
Date: Sun, 01 Jun 2025 09:11:24 GMT
Title: Q-learning with Posterior Sampling
Authors: Priyank Agrawal, Shipra Agrawal, Azmat Azati,
Abstract summary: We introduce Q-Learning with Posterior Sampling (P=KH), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration.<n>We show that P achieves a regret bound of $tilde O(H2sqrtSAT)$, closely matching the known lower bound of $Omega(HsqrtSAT)$.<n>Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms.
Score: 3.598052011212994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Related papers

Provably Efficient and Agile Randomized Q-Learning [35.14581235983678]
We propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates.<n> Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.
arXiv Detail & Related papers (2025-06-30T16:08:29Z)
Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling [46.035795210898414]
We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models.<n>Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS)<n>Our work showcases the value of information theory in reinforcement learning and in the training of large language models.
arXiv Detail & Related papers (2025-02-08T03:47:00Z)
Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling [14.776559457850624]
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms.
arXiv Detail & Related papers (2024-05-29T11:59:56Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.<n>We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.