Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
- URL: http://arxiv.org/abs/2506.07040v2
- Date: Mon, 08 Sep 2025 15:34:50 GMT
- Title: Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
- Authors: Yang Xu, Swetha Ganesh, Vaneet Aggarwal,
- Abstract summary: We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for MDPs.<n>A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction.<n>We introduce an actor-critic algorithm that learns an $epsilon$-optimal robust policy within $tildemathcalO(epsilon-2)$ samples.
- Score: 44.223491175769105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $\epsilon$-optimal robust policy within $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.
Related papers
- Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning [5.8191965840377735]
We propose two algorithms that achieve near-optimal sample complexity.<n>We prove that both algorithms attain a sample complexity of $widetildeOleft(|mathbfS||mathbfA| t_mathrmmix2varepsilon-2right)$ for estimating the optimal policy.<n>This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning.
arXiv Detail & Related papers (2025-05-15T06:42:25Z) - Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits [49.96531901205305]
We analyze $f$-divergence-regularized offline policy learning.<n>For reverse Kullback-Leibler (KL) divergence, we give the first $tildeO(epsilon-1)$ sample complexity under single-policy concentrability.<n>We extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - Beyond likelihood ratio bias: Nested multi-time-scale stochastic approximation for likelihood-free parameter estimation [49.78792404811239]
We study inference in simulation-based models where the analytical form of the likelihood is unknown.<n>We use a ratio-free nested multi-time-scale approximation (SA) method that simultaneously tracks the score and drives the parameter update.<n>We show that our algorithm can eliminate the original bias $Obig(sqrtfrac1Nbig)$ and accelerate the convergence rate from $Obig(beta_k+sqrtfracalpha_kNbig)$.
arXiv Detail & Related papers (2024-11-20T02:46:15Z) - Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation [18.77565744533582]
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning.
AC converges to an $epsilon+varepsilon_textcritic$ neighborhood of stationary points with the best known sample complexity.
This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation.
arXiv Detail & Related papers (2024-06-03T20:05:04Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal.<n>This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z) - Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination [42.526664955704746]
We study sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression.
For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees.
At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.
arXiv Detail & Related papers (2024-03-15T15:51:27Z) - Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms [6.304715653196449]
We consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes.<n>We carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting.<n>We also show the results of experiments on three different Safety-Gym environments.
arXiv Detail & Related papers (2023-10-25T05:04:00Z) - Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards [27.209606183563853]
We establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism can be provably robustified against heavy-tailed reward distributions.
We show that a robust variant of NAC based on TD learning achieves $tildemathcalO(varepsilon-frac1p)$ sample complexity.
arXiv Detail & Related papers (2023-06-20T11:12:21Z) - Improved Sample Complexity Bounds for Distributionally Robust
Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment.
We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences.
We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z) - Generalized Differentiable RANSAC [95.95627475224231]
$nabla$-RANSAC is a differentiable RANSAC that allows learning the entire randomized robust estimation pipeline.
$nabla$-RANSAC is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives.
arXiv Detail & Related papers (2022-12-26T15:13:13Z) - Robust Methods for High-Dimensional Linear Learning [0.0]
We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting.
We instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery.
For vanilla $s$-sparsity, we are able to reach the $slog (d)/n$ rate under heavy-tails and $eta$-corruption.
arXiv Detail & Related papers (2022-08-10T17:00:41Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z) - A Finite Time Analysis of Two Time-Scale Actor Critic Methods [87.69128666220016]
We provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting.
We prove that the actor-critic method is guaranteed to find a first-order stationary point.
This is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.
arXiv Detail & Related papers (2020-05-04T09:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.