Related papers: EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

URL: http://arxiv.org/abs/2512.15405v1
Date: Wed, 17 Dec 2025 12:55:05 GMT
Title: EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
Authors: Jianfei Ma, Wee Sun Lee,
Abstract summary: Epistemic uncertainty reflects systematic uncertainty due to limited knowledge.<n>We propose a Bayesian reinforcement learning algorithm, $texttEUBRL$, which leverages guidance to achieve principled exploration.
Score: 22.84927928856004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.

Related papers

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning [30.006968507593452]
Inference-time reasoning has advanced the capabilities of Large Language Models (LLMs) in complex problem-solving.<n>A prevalent approach involves external search guided by Process Reward Models (PRMs)<n>We provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples.<n>We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret.
arXiv Detail & Related papers (2026-02-06T08:40:05Z)
BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search [72.87861928940929]
Boundary-Aware Policy Optimization (BAPO) is a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy.<n>BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut.
arXiv Detail & Related papers (2026-01-16T07:06:58Z)
Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation [68.106428321492]
Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding.<n>LLMs hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans.<n>We present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately.
arXiv Detail & Related papers (2025-10-09T10:26:58Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds [28.4976864705409]
This paper studies constrained Markov decision processes (CMDPs) with constraints against thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments.<n>We leverage a GrowingWindow estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Pessimistic-Optimistic Thresholding (SPOT)<n>SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings.
arXiv Detail & Related papers (2025-04-07T11:58:19Z)
Diffusion Policies for Risk-Averse Behavior Modeling in Offline Reinforcement Learning [26.34178581703107]
offline reinforcement learning (RL) presents distinct challenges as it relies solely on observational data.<n>We propose an uncertainty-aware distributional offline RL method to simultaneously address both uncertainty and environmentality.<n>Our method is rigorously evaluated through comprehensive experiments in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior performance.
arXiv Detail & Related papers (2024-03-26T12:28:04Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Latent State Marginalization as a Low-cost Approach for Improving Exploration [79.12247903178934]
We propose the adoption of latent variable policies within the MaxEnt framework. We show that latent variable policies naturally emerges under the use of world models with a latent belief state. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
arXiv Detail & Related papers (2022-10-03T15:09:12Z)
Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty [5.117030416610515]
Online reinforcement learning (RL) has been widely applied in information processing scenarios. We consider an un-discounted RL in general Markov decision processes (MDPs) with both endogeneous and exogeneous uncertainty. We establish a regret bound of saving at most $sqrtS$ or $Sfrac16Tfrac112$ compared with the latest results in the literature.
arXiv Detail & Related papers (2021-06-15T22:57:45Z)
Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$. We propose a novel robust estimator which does not require $nu_p$ as prior information. We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z)
Temporal Difference Uncertainties as a Signal for Exploration [76.6341354269013]
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. We propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors.
arXiv Detail & Related papers (2020-10-05T18:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.