A Study of Policy Gradient on a Class of Exactly Solvable Models
- URL: http://arxiv.org/abs/2011.01859v1
- Date: Tue, 3 Nov 2020 17:27:53 GMT
- Title: A Study of Policy Gradient on a Class of Exactly Solvable Models
- Authors: Gavin McCracken, Colin Daniels, Rosie Zhao, Anna Brandenberger,
Prakash Panangaden, Doina Precup
- Abstract summary: We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain.
Our approach relies heavily on random walk theory, specifically on affine Weyl groups.
We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
- Score: 35.90565839381652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient methods are extensively used in reinforcement learning as a
way to optimize expected return. In this paper, we explore the evolution of the
policy parameters, for a special class of exactly solvable POMDPs, as a
continuous-state Markov chain, whose transition probabilities are determined by
the gradient of the distribution of the policy's value. Our approach relies
heavily on random walk theory, specifically on affine Weyl groups. We construct
a class of novel partially observable environments with controllable
exploration difficulty, in which the value distribution, and hence the policy
parameter evolution, can be derived analytically. Using these environments, we
analyze the probabilistic convergence of policy gradient to different local
maxima of the value function. To our knowledge, this is the first approach
developed to analytically compute the landscape of policy gradient in POMDPs
for a class of such environments, leading to interesting insights into the
difficulty of this problem.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm.
By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning.
Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - A Large Deviations Perspective on Policy Gradient Algorithms [6.075593833879357]
Motivated by policy gradient methods, we identify a large deviation function for a rate iterates generated by gradient methods.
We show how this phenomenon can be naturally extended to a wide spectrum of other policy parametrizations.
arXiv Detail & Related papers (2023-11-13T15:44:27Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - A Policy Gradient Method for Confounded POMDPs [7.75007282943125]
We propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting.
We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data.
arXiv Detail & Related papers (2023-05-26T16:48:05Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage [0.0]
We focus on policy approximations based on Model Predictive Control (MPC)
We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure.
We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
arXiv Detail & Related papers (2021-04-06T10:37:14Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.