Variational Policy Gradient Method for Reinforcement Learning with
General Utilities
- URL: http://arxiv.org/abs/2007.02151v1
- Date: Sat, 4 Jul 2020 17:51:53 GMT
- Title: Variational Policy Gradient Method for Reinforcement Learning with
General Utilities
- Authors: Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and
Mengdi Wang
- Abstract summary: In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction.
In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function.
We derive a new Variational Policy Gradient Theorem for RL with general utilities.
- Score: 38.54243339632217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, reinforcement learning (RL) systems with general goals
beyond a cumulative sum of rewards have gained traction, such as in constrained
problems, exploration, and acting upon prior experiences. In this paper, we
consider policy optimization in Markov Decision Problems, where the objective
is a general concave utility function of the state-action occupancy measure,
which subsumes several of the aforementioned examples as special cases. Such
generality invalidates the Bellman equation. As this means that dynamic
programming no longer works, we focus on direct policy search. Analogously to
the Policy Gradient Theorem \cite{sutton2000policy} available for RL with
cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL
with general utilities, which establishes that the parametrized policy gradient
may be obtained as the solution of a stochastic saddle point problem involving
the Fenchel dual of the utility function. We develop a variational Monte Carlo
gradient estimation algorithm to compute the policy gradient based on sample
paths. We prove that the variational policy gradient scheme converges globally
to the optimal policy for the general objective, though the optimization
problem is nonconvex. We also establish its rate of convergence of the order
$O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it
converges exponentially when the problem admits hidden strong convexity. Our
analysis applies to the standard RL problem with cumulative rewards as a
special case, in which case our result improves the available convergence rate.
Related papers
- Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning [0.46040036610482665]
Cumulative Prospect Theory (CPT) has been developed to provide a better model for human-based decision-making supported by empirical evidence.
A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem.
We show that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
arXiv Detail & Related papers (2024-10-03T15:45:39Z) - Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Policy Gradient for Reinforcement Learning with General Utilities [50.65940899590487]
In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards.
Many supervised and unsupervised RL problems are not covered in the Linear RL framework.
We derive the policy gradient theorem for RL with general utilities.
arXiv Detail & Related papers (2022-10-03T14:57:46Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.