Related papers: Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective

Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective

URL: http://arxiv.org/abs/2505.06518v2
Date: Mon, 07 Jul 2025 00:26:07 GMT
Title: Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective
Authors: Larry Preuett III, Qiuyi Zhang, Muhammad Aurangzeb Ahmad,
Abstract summary: We extend Distributional Reinforcement Learning (DistRL) to Partially Observable Markov Decision Processes (POMDPs)<n>We introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric.<n>We develop Distributional Point-Based Value Iteration (DPBVI) which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning.
Score: 4.794801758504495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many real-world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes of any chosen policy. We address both forms of uncertainty as a first step toward safer algorithms in partially observable settings. Specifically, we extend Distributional Reinforcement Learning (DistRL)-which models the entire return distribution for fully observable domains-to Partially Observable Markov Decision Processes (POMDPs), allowing an agent to learn the distribution of returns for each conditional plan. Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning. By tracking return distributions, DPBVI naturally enables risk-sensitive control in domains where rare, high-impact events must be carefully managed. We provide source code to foster further research in robust decision-making under partial observability.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
On Policy Evaluation Algorithms in Distributional Reinforcement Learning [0.0]
We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL) For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. For return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm.
arXiv Detail & Related papers (2024-07-19T10:06:01Z)
Risk-sensitive Markov Decision Process and Learning under General Utility Functions [3.069335774032178]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.<n>We consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a decision process (MDP)<n>We propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward.<n>In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return. We show that this distribution can be approximated by a finite number of random variables. Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z)
GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making. We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z)
Robustness and risk management via distributional dynamic programming [13.173307471333619]
We introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation. Our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate. We derive distributional operators and DP algorithms solving a new control task.
arXiv Detail & Related papers (2021-12-28T12:12:57Z)
Bayesian Distributional Policy Gradients [2.28438857884398]
Distributional Reinforcement Learning maintains the entire probability distribution of the reward-to-go, i.e. the return. Bayesian Distributional Policy Gradients (BDPG) uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns.
arXiv Detail & Related papers (2021-03-20T23:42:50Z)
Learning Invariant Representations and Risks for Semi-supervised Domain Adaptation [109.73983088432364]
We propose the first method that aims to simultaneously learn invariant representations and risks under the setting of semi-supervised domain adaptation (Semi-DA) We introduce the LIRR algorithm for jointly textbfLearning textbfInvariant textbfRepresentations and textbfRisks.
arXiv Detail & Related papers (2020-10-09T15:42:35Z)
Distributional Reinforcement Learning via Moment Matching [54.16108052278444]
We formulate a method that learns a finite set of statistics from each return distribution via neural networks. Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target. Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines.
arXiv Detail & Related papers (2020-07-24T05:18:17Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.