Related papers: TOPS: Transition-based VOlatility-controlled Policy Search and its Global Convergence

TOPS: Transition-based VOlatility-controlled Policy Search and its Global Convergence

URL: http://arxiv.org/abs/2201.09857v1
Date: Mon, 24 Jan 2022 18:29:23 GMT
Title: TOPS: Transition-based VOlatility-controlled Policy Search and its Global Convergence
Authors: Liangliang Xu, Aiwen Jiang, Daoming Lyu, Bo Liu
Abstract summary: This paper proposes Transition-based VOlatility-controlled Policy Search (TOPS) It is a novel algorithm that solves risk-averse problems by learning from (possibly non-consecutive) transitions instead of only consecutive trajectories. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of risk-averse policy search methods.
Score: 9.607937067646617
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Risk-averse problems receive far less attention than risk-neutral control problems in reinforcement learning, and existing risk-averse approaches are challenging to deploy to real-world applications. One primary reason is that such risk-averse algorithms often learn from consecutive trajectories with a certain length, which significantly increases the potential danger of causing dangerous failures in practice. This paper proposes Transition-based VOlatility-controlled Policy Search (TOPS), a novel algorithm that solves risk-averse problems by learning from (possibly non-consecutive) transitions instead of only consecutive trajectories. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of risk-averse policy search methods.

Related papers

Planning and Learning in Average Risk-aware MDPs [4.696083734269232]
We extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Both the RVI and Q-learning algorithms are proven to converge to optimality. Our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
arXiv Detail & Related papers (2025-03-22T03:18:09Z)
Risk-averse learning with delayed feedback [17.626195546400247]
We develop two risk-averse learning algorithms that rely on one-point and two-point zeroth-order optimization approaches. The results suggest that the two-point risk-averse learning achieves a smaller regret bound than the one-point algorithm.
arXiv Detail & Related papers (2024-09-25T12:32:22Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis [16.32678094159896]
Reinforcement Learning (RL) has shown exceptional performance across various applications. Risk-sensitive policy gradient methods, which incorporate both expected return and risk measures, have been explored for their ability to yield more robust policies. We conduct a rigorous iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm with an exponential utility function.
arXiv Detail & Related papers (2024-03-13T20:50:49Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War [0.0]
We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War. This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making.
arXiv Detail & Related papers (2023-07-17T20:59:50Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications. We propose a learning algorithm that decouples the action constraints from the policy parameter update. We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.