Related papers: Generalized Bayesian deep reinforcement learning

Generalized Bayesian deep reinforcement learning

URL: http://arxiv.org/abs/2412.11743v2
Date: Mon, 02 Jun 2025 11:22:35 GMT
Title: Generalized Bayesian deep reinforcement learning
Authors: Shreya Sinha Roy, Richard G. Everitt, Christian P. Robert, Ritabrata Dutta,
Abstract summary: We propose to model the dynamics of the unknown environment through deep generative models, assuming Markov dependence.<n>In the absence of likelihood functions for these models, we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior.<n>For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximising the expected value function with respect to the posterior distribution.
Score: 2.469908534801392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. As a model-based RL method, it has two key components: (1) inferring the posterior distribution of the model for the data-generating process (DGP) and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models, assuming Markov dependence. In the absence of likelihood functions for these models, we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We used sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high-dimensional parameter space of the neural networks, we use the gradient-based Markov kernels within SMC. To justify the use of the prequential scoring rule posterior, we prove a Bernstein-von Mises-type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximising the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions, which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies, assuming a discrete action space. Finally, we successfully extended our setup for a challenging problem with a continuous action space without theoretical guarantees.

Related papers

Reinforced sequential Monte Carlo for amortised sampling [49.92678178064033]
We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL)<n>We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance.
arXiv Detail & Related papers (2025-10-13T17:59:11Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
von Mises Quasi-Processes for Bayesian Circular Regression [57.88921637944379]
We explore a family of expressive and interpretable distributions over circle-valued random functions. The resulting probability model has connections with continuous spin models in statistical physics. For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Markov Chain Monte Carlo sampling.
arXiv Detail & Related papers (2024-06-19T01:57:21Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Learning Merton's Strategies in an Incomplete Market: Recursive Entropy Regularization and Biased Gaussian Exploration [11.774563966512709]
We take the reinforcement learning (RL) approach to learn optimal portfolio policies directly by exploring the unknown market. We present an analysis of the resulting errors to show how the level of exploration affects the learned policies.
arXiv Detail & Related papers (2023-12-19T02:14:13Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Bayesian Neural Network Inference via Implicit Models and the Posterior Predictive Distribution [0.8122270502556371]
We propose a novel approach to perform approximate Bayesian inference in complex models such as Bayesian neural networks. The approach is more scalable to large data than Markov Chain Monte Carlo. We see this being useful in applications such as surrogate and physics-based models.
arXiv Detail & Related papers (2022-09-06T02:43:19Z)
Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z)
What Are Bayesian Neural Network Posteriors Really Like? [63.950151520585024]
We show that Hamiltonian Monte Carlo can achieve significant performance gains over standard and deep ensembles. We also show that deep distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.
arXiv Detail & Related papers (2021-04-29T15:38:46Z)
Model-based Reinforcement Learning for Continuous Control with Posterior Sampling [10.91557009257615]
We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces. We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
arXiv Detail & Related papers (2020-11-20T21:00:31Z)
Bayes-Adaptive Deep Model-Based Policy Optimisation [4.675381958034012]
We introduce a Bayesian (deep) model-based reinforcement learning method (RoMBRL) that can capture model uncertainty to achieve sample-efficient policy optimisation. We propose to formulate the model-based policy optimisation problem as a Bayes-adaptive Markov decision process (BAMDP) We show that RoMBRL outperforms existing approaches on many challenging control benchmark tasks in terms of sample complexity and task performance.
arXiv Detail & Related papers (2020-10-29T21:17:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.