Related papers: Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

URL: http://arxiv.org/abs/2305.13289v3
Date: Sun, 3 Dec 2023 23:00:11 GMT
Title: Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach
Authors: Yue Wang, Jinjun Xiong, Shaofeng Zou
Abstract summary: offline reinforcement learning aims to learn from pre-collected datasets without active exploration. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs. We show that the distributionally robust optimization approach can also address these challenges and is minimax optimal.
Score: 41.45280954901834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is minimax optimal. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then find the policy that optimizes the worst-case performance over this uncertainty set. We first design a metric-based Hoeffding-style uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $\epsilon$, the sample complexity is $\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative Bernstein-style uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which matches with the minimax lower bound for offline reinforcement learning, and thus is minimax optimal.

Related papers

A Variance-Reduced Cubic-Regularized Newton for Policy Optimization [6.52142708235708]
Existing second-order methods often suffer from suboptimal sample complexity or unrealistic assumptions about importance sampling.<n>To overcome these limitations, we propose VR-CR-PN, a variance-regularized Newton-reduced estimator.<n>As an additional contribution, we introduce a novel horizon for the expected return function, allowing the algorithm to achieve a uniform sample complexity.
arXiv Detail & Related papers (2025-07-14T10:04:02Z)
Enhancing Distributional Robustness in Principal Component Analysis by Wasserstein Distances [7.695578200868269]
We consider the distributionally robust optimization (DRO) model of principal component analysis (PCA) to account for uncertainty in the underlying probability distribution.<n>The resulting formulation leads to a nonsmooth constrained min-max optimization problem, where the ambiguity set captures the distributional uncertainty by the type-$2$ Wasserstein distance.<n>This explicit characterization equivalently reformulates the original DRO model into a minimization problem on the Stiefel manifold with intricate nonsmooth terms.
arXiv Detail & Related papers (2025-03-04T11:00:08Z)
Efficiently Solving Discounted MDPs with Predictions on Transition Matrices [6.199300239433395]
We study Discounted Markov Decision Processes (DMDPs) under a generative model. We propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs.
arXiv Detail & Related papers (2025-02-21T09:59:46Z)
Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL [6.996002801232415]
We study the sample complexity of finding an $varepsilon$-optimal policy in Markov Decision Processes (MDPs) with a generative model. We develop the first algorithms matching the optimal span-based complexity without $H$ knowledge.
arXiv Detail & Related papers (2025-02-16T19:10:55Z)
Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits. Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator. We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences. We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z)
Stochastic Inexact Augmented Lagrangian Method for Nonconvex Expectation Constrained Optimization [88.0031283949404]
Many real-world problems have complicated non functional constraints and use a large number of data points. Our proposed method outperforms an existing method with the previously best-known result.
arXiv Detail & Related papers (2022-12-19T14:48:54Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions. We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism. We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z)
Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates [0.0]
We show that the first optimality gap of the proposed algorithm decays at the rate of the expected loss for various methods under nontens dependent data setting. We obtain first convergence point for various methods under nontens dependent data setting.
arXiv Detail & Related papers (2022-01-05T15:17:35Z)
Distributed Sparse Regression via Penalization [5.990069843501885]
We study linear regression over a network of agents, modeled as an undirected graph (with no centralized node) The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint. We show that the proximal-gradient algorithm applied to the penalized problem converges linearly up to a tolerance of the order of the centralized statistical error.
arXiv Detail & Related papers (2021-11-12T01:51:50Z)
Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis [16.499651513178012]
Distributionally robust optimization (DRO) is a widely-used approach to learn models that are robust against distribution shift. We provide non-asymptotic convergence guarantees even though the objective function is possibly prominent nonsmooth- and has normalized gradient descent.
arXiv Detail & Related papers (2021-10-24T14:56:38Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components. We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.