Causal Policy Gradients
- URL: http://arxiv.org/abs/2102.10362v1
- Date: Sat, 20 Feb 2021 14:51:12 GMT
- Title: Causal Policy Gradients
- Authors: Thomas Spooner, Nelson Vadori, Sumitra Ganesh
- Abstract summary: Causal policy gradients (CPGs) provide a common framework for analysing key state-of-the-art algorithms.
CPGs are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes.
- Score: 6.123324869194195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient methods can solve complex tasks but often fail when the
dimensionality of the action-space or objective multiplicity grow very large.
This occurs, in part, because the variance on score-based gradient estimators
scales quadratically with the number of targets. In this paper, we propose a
causal baseline which exploits independence structure encoded in a novel
action-target influence network. Causal policy gradients (CPGs), which follow,
provide a common framework for analysing key state-of-the-art algorithms, are
shown to generalise traditional policy gradients, and yield a principled way of
incorporating prior knowledge of a problem domain's generative processes. We
provide an analysis of the proposed estimator and identify the conditions under
which variance is guaranteed to improve. The algorithmic aspects of CPGs are
also discussed, including optimal policy factorisations, their complexity, and
the use of conditioning to efficiently scale to extremely large, concurrent
tasks. The performance advantages for two variants of the algorithm are
demonstrated on large-scale bandit and concurrent inventory management
problems.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - Processing Network Controls via Deep Reinforcement Learning [0.0]
dissertation is concerned with theoretical justification and practical application of the advanced policy gradient algorithms.
Policy improvement bounds play a crucial role in the theoretical justification of the APG algorithms.
arXiv Detail & Related papers (2022-05-01T04:34:21Z) - Instance-Dependent Confidence and Early Stopping for Reinforcement
Learning [99.57168572237421]
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure.
This research provides guarantees that explain textitex post the performance differences observed.
A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice.
arXiv Detail & Related papers (2022-01-21T04:25:35Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.