Related papers: Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

URL: http://arxiv.org/abs/2505.12462v2
Date: Thu, 25 Sep 2025 14:09:15 GMT
Title: Provably Sample-Efficient Robust Reinforcement Learning with Average Reward
Authors: Zachary Roch, Chi Zhang, George Atia, Yue Wang,
Abstract summary: We propose a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $ell_p$-norm and contamination models.<n>Our algorithm operates without requiring any prior knowledge of the robust MDP.<n>Our work provides essential theoretical understanding of sample efficiency of robust average reward RL.
Score: 4.530028899565083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its specification. However, a significant gap exists in understanding the finite-sample complexity of these methods, as most existing work provides only asymptotic guarantees. This limitation hinders their principled understanding and practical deployment, especially in data-limited scenarios. We close this gap by proposing \textbf{Robust Halpern Iteration (RHI)}, a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $\ell_p$-norm and contamination models. Our approach offers three key advantages over previous methods: (1). Weaker Structural Assumptions: RHI only requires the underlying robust MDP to be communicating, a less restrictive condition than the commonly assumed ergodicity or irreducibility; (2). No Prior Knowledge: Our algorithm operates without requiring any prior knowledge of the robust MDP; (3). State-of-the-Art Sample Complexity: To learn an $\epsilon$-optimal robust policy, RHI achieves a sample complexity of $\tilde{\mathcal O}\left(\frac{SA\mathcal H^{2}}{\epsilon^{2}}\right)$, where $S$ and $A$ denote the numbers of states and actions, and $\mathcal H$ is the robust optimal bias span. This result represents the tightest known bound. Our work hence provides essential theoretical understanding of sample efficiency of robust average reward RL.

Related papers

Provably Efficient Sample Complexity for Robust CMDP [7.060086147428817]
We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints.<n>We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold.<n>We propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $mathcaltildeO(|S||A|H5/2)$ achieving at most $$ violations.
arXiv Detail & Related papers (2025-11-10T04:40:37Z)
Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation [5.638124543342179]
We present the first robust temporal-difference learning with linear function approximation.<n>Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts.
arXiv Detail & Related papers (2025-10-02T07:01:41Z)
Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs [6.237808815887583]
We study the sample complexity of learning an $epsilon$-optimal policy in constrained average-reward MDPs under a generative model.<n>Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.
arXiv Detail & Related papers (2025-09-20T09:19:42Z)
A Variance-Reduced Cubic-Regularized Newton for Policy Optimization [6.52142708235708]
Existing second-order methods often suffer from suboptimal sample complexity or unrealistic assumptions about importance sampling.<n>To overcome these limitations, we propose VR-CR-PN, a variance-regularized Newton-reduced estimator.<n>As an additional contribution, we introduce a novel horizon for the expected return function, allowing the algorithm to achieve a uniform sample complexity.
arXiv Detail & Related papers (2025-07-14T10:04:02Z)
Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees [16.01190705000295]
Constrained decision-making is essential for designing safe policies in real-world control systems.<n>We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints.<n>We prove that such an algorithm finds a policy with at most $epsilon$ sub-optimality and feasible policy after $O(epsilon-2)$.
arXiv Detail & Related papers (2025-05-25T17:27:06Z)
Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning [5.8191965840377735]
We propose two algorithms that achieve near-optimal sample complexity.<n>We prove that both algorithms attain a sample complexity of $widetildeOleft(|mathbfS||mathbfA| t_mathrmmix2varepsilon-2right)$ for estimating the optimal policy.<n>This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning.
arXiv Detail & Related papers (2025-05-15T06:42:25Z)
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning [33.71515983281633]
We present first finite-sample analysis for policy evaluation in robust average-rewards.<n>Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently.<n>Our method achieves the order-optimal sample complexity of $tildemathcalO(epsilon-2)$ for robust policy evaluation and robust average reward estimation.
arXiv Detail & Related papers (2025-02-24T03:55:09Z)
Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL [6.996002801232415]
We study the sample complexity of finding an $varepsilon$-optimal policy in Markov Decision Processes (MDPs) with a generative model.<n>We develop the first algorithms matching the optimal span-based complexity without $H$ knowledge.
arXiv Detail & Related papers (2025-02-16T19:10:55Z)
Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits [49.96531901205305]
We analyze $f$-divergence-regularized offline policy learning.<n>For reverse Kullback-Leibler (KL) divergence, we give the first $tildeO(epsilon-1)$ sample complexity under single-policy concentrability.<n>We extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems. "vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.