TOPS: Transition-based VOlatility-controlled Policy Search and its
Global Convergence
- URL: http://arxiv.org/abs/2201.09857v1
- Date: Mon, 24 Jan 2022 18:29:23 GMT
- Title: TOPS: Transition-based VOlatility-controlled Policy Search and its
Global Convergence
- Authors: Liangliang Xu, Aiwen Jiang, Daoming Lyu, Bo Liu
- Abstract summary: This paper proposes Transition-based VOlatility-controlled Policy Search (TOPS)
It is a novel algorithm that solves risk-averse problems by learning from (possibly non-consecutive) transitions instead of only consecutive trajectories.
Both theoretical analysis and experimental results demonstrate a state-of-the-art level of risk-averse policy search methods.
- Score: 9.607937067646617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Risk-averse problems receive far less attention than risk-neutral control
problems in reinforcement learning, and existing risk-averse approaches are
challenging to deploy to real-world applications. One primary reason is that
such risk-averse algorithms often learn from consecutive trajectories with a
certain length, which significantly increases the potential danger of causing
dangerous failures in practice. This paper proposes Transition-based
VOlatility-controlled Policy Search (TOPS), a novel algorithm that solves
risk-averse problems by learning from (possibly non-consecutive) transitions
instead of only consecutive trajectories. By using an actor-critic scheme with
an overparameterized two-layer neural network, our algorithm finds a globally
optimal policy at a sublinear rate with proximal policy optimization and
natural policy gradient, with effectiveness comparable to the state-of-the-art
convergence rate of risk-neutral policy-search methods. The algorithm is
evaluated on challenging Mujoco robot simulation tasks under the mean-variance
evaluation metric. Both theoretical analysis and experimental results
demonstrate a state-of-the-art level of risk-averse policy search methods.
Related papers
- Risk-averse learning with delayed feedback [17.626195546400247]
We develop two risk-averse learning algorithms that rely on one-point and two-point zeroth-order optimization approaches.
The results suggest that the two-point risk-averse learning achieves a smaller regret bound than the one-point algorithm.
arXiv Detail & Related papers (2024-09-25T12:32:22Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War [0.0]
We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War.
This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making.
arXiv Detail & Related papers (2023-07-17T20:59:50Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.