Related papers: Approximation Benefits of Policy Gradient Methods with Aggregated States

Approximation Benefits of Policy Gradient Methods with Aggregated States

URL: http://arxiv.org/abs/2007.11684v3
Date: Thu, 23 Jun 2022 15:39:09 GMT
Title: Approximation Benefits of Policy Gradient Methods with Aggregated States
Authors: Daniel Russo
Abstract summary: Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $epsilon$.
Score: 8.348171150908724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Ordering-based Conditions for Global Convergence of Policy Gradient Methods [73.6366483406033]
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation.
arXiv Detail & Related papers (2025-04-02T21:06:28Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy. Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives. In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation [11.345796608258434]
We show that a first-visit version of such a policy iteration scheme converges to the optimal policy provided that the policy improvement step uses lookahead. We also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.
arXiv Detail & Related papers (2023-01-23T20:32:41Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation [14.528756508275622]
We show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed. And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy.
arXiv Detail & Related papers (2021-09-28T01:20:08Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE. Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal. We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z)
Inverse Policy Evaluation for Value-based Sequential Decision-making [10.188967035477217]
Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. We show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.
arXiv Detail & Related papers (2020-08-26T01:31:38Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.