Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy
Gradient Algorithms
- URL: http://arxiv.org/abs/2305.07248v1
- Date: Fri, 12 May 2023 04:47:02 GMT
- Title: Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy
Gradient Algorithms
- Authors: Jinyang Jiang, Jiaqiao Hu, and Yijie Peng
- Abstract summary: We parameterize the policy controlling actions by neural networks, and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO)
Our numerical results indicate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical reinforcement learning (RL) aims to optimize the expected
cumulative reward. In this work, we consider the RL setting where the goal is
to optimize the quantile of the cumulative reward. We parameterize the policy
controlling actions by neural networks, and propose a novel policy gradient
algorithm called Quantile-Based Policy Optimization (QPO) and its variant
Quantile-Based Proximal Policy Optimization (QPPO) for solving deep RL problems
with quantile objectives. QPO uses two coupled iterations running at different
timescales for simultaneously updating quantiles and policy parameters, whereas
QPPO is an off-policy version of QPO that allows multiple updates of parameters
during one simulation episode, leading to improved algorithm efficiency. Our
numerical results indicate that the proposed algorithms outperform the existing
baseline algorithms under the quantile criterion.
Related papers
- Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Processing Network Controls via Deep Reinforcement Learning [0.0]
dissertation is concerned with theoretical justification and practical application of the advanced policy gradient algorithms.
Policy improvement bounds play a crucial role in the theoretical justification of the APG algorithms.
arXiv Detail & Related papers (2022-05-01T04:34:21Z) - Quantile-Based Policy Optimization for Reinforcement Learning [0.0]
We parameterize the policy controlling actions by neural networks and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO)
Our numerical results demonstrate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
arXiv Detail & Related papers (2022-01-27T12:01:36Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Proximal Deterministic Policy Gradient [20.951797549505986]
We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms.
We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate.
We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
arXiv Detail & Related papers (2020-08-03T10:19:59Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - Mirror Descent Policy Optimization [41.46894905097985]
We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO)
MDPO iteratively updates the policy by em approximately solving a trust-region problem.
We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
arXiv Detail & Related papers (2020-05-20T01:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.