Reinforcement Learning with General Utilities: Simpler Variance
Reduction and Large State-Action Space
- URL: http://arxiv.org/abs/2306.01854v1
- Date: Fri, 2 Jun 2023 18:16:35 GMT
- Title: Reinforcement Learning with General Utilities: Simpler Variance
Reduction and Large State-Action Space
- Authors: Anas Barakat, Ilyas Fatkhullin, Niao He
- Abstract summary: We consider the reinforcement learning problem with general utilities.
Our algorithm achieves $tildemathcalO(epsilon-3)$ and $tildemathcalO(epsilon-2)$ sample complexities.
- Score: 17.366915676628867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the reinforcement learning (RL) problem with general utilities
which consists in maximizing a function of the state-action occupancy measure.
Beyond the standard cumulative reward RL setting, this problem includes as
particular cases constrained RL, pure exploration and learning from
demonstrations among others. For this problem, we propose a simpler single-loop
parameter-free normalized policy gradient algorithm. Implementing a recursive
momentum variance reduction mechanism, our algorithm achieves
$\tilde{\mathcal{O}}(\epsilon^{-3})$ and $\tilde{\mathcal{O}}(\epsilon^{-2})$
sample complexities for $\epsilon$-first-order stationarity and
$\epsilon$-global optimality respectively, under adequate assumptions. We
further address the setting of large finite state action spaces via linear
function approximation of the occupancy measure and show a
$\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity for a simple policy
gradient method with a linear regression subroutine.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Demonstration-Regularized RL [39.96273388393764]
Using expert demonstrations, we identify an optimal policy at a sample complexity of order $widetildeO(mathrmPoly(S,A,H)/(varepsilon2 NmathrmE)$ in finite and $widetildeO(mathrmPoly(d,H)/(varepsilon2 NmathrmE)$ in linear Markov decision processes.
We establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-10-26T10:54:47Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization [53.044526424637866]
In this paper we consider finding an approximate second-order stationary point (SOSP) that minimizes a twice different subject general non conic optimization.
In particular, we propose a Newton-CG based-augmentedconjugate method for finding an approximate SOSP.
arXiv Detail & Related papers (2023-01-10T20:43:29Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Policy Mirror Descent for Reinforcement Learning: Linear Convergence,
New Sampling Complexity, and Generalized Problem Classes [6.240369435223]
We present new policy mirror descent (PMD) methods for solving reinforcement learning problems with strongly convex or general convex regularizers.
To the best of our knowledge, these developments appear to be new in both optimization and flexibility convexizers.
arXiv Detail & Related papers (2021-01-30T02:30:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.