Related papers: Error Controlled Actor-Critic

Error Controlled Actor-Critic

URL: http://arxiv.org/abs/2109.02517v2
Date: Tue, 7 Sep 2021 03:08:50 GMT
Title: Error Controlled Actor-Critic
Authors: Xingen Gao, Fei Chao, Changle Zhou, Zhen Ge, Chih-Min Lin, Longzhi Yang, Xiang Chang, and Changjing Shang
Abstract summary: On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms. We propose Error Controlled Actor-critic which ensures confining the approximation error in value function.
Score: 7.936003142729818
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms. To mitigate the negative effects of the approximation error, we propose Error Controlled Actor-critic which ensures confining the approximation error in value function. We present an analysis of how the approximation error can hinder the optimization process of actor-critic methods.Then, we derive an upper boundary of the approximation error of Q function approximator and find that the error can be lowered by restricting on the KL-divergence between every two consecutive policies when training the policy. The results of experiments on a range of continuous control tasks demonstrate that the proposed actor-critic algorithm apparently reduces the approximation error and significantly outperforms other model-free RL algorithms.

Related papers

Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function. We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)
Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation [2.2120851074630177]
In robotic control tasks, policies trained by reinforcement learning (RL) in simulation often experience a performance drop when deployed on physical hardware. We propose that Lipschitz regularization can help condition the approximated value function gradients, leading to improved robustness after training.
arXiv Detail & Related papers (2024-04-22T05:01:29Z)
A Case for Validation Buffer in Pessimistic Actor-Critic [1.5022206231191775]
We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value. We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
arXiv Detail & Related papers (2024-03-01T22:24:11Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Sharper Bounds for Proximal Gradient Algorithms with Errors [6.901159341430919]
We analyse the convergence of the proximal gradient algorithm for convex composite problems in the presence of gradient and proximal computational inaccuracies. We derive new tighter deterministic and probabilistic bounds that we use to verify a simulated (MPC) and a synthetic (LASSO) optimization problems solved on a reduced-precision machine.
arXiv Detail & Related papers (2022-03-04T09:27:08Z)
Accelerated and instance-optimal policy evaluation with linear function approximation [17.995515643150657]
Existing algorithms fail to match at least one of these lower bounds. We develop an accelerated, variance-reduced fast temporal difference algorithm that simultaneously matches both lower bounds and attains a strong notion of instance-optimality.
arXiv Detail & Related papers (2021-12-24T17:21:04Z)
Tightening the Approximation Error of Adversarial Risk with Auto Loss Function Search [12.263913626161155]
A common type of evaluation is to approximate the adversarial risk of a model as a robustness indicator. We propose AutoLoss-AR, the first method for searching loss functions for tightening the error. The results demonstrate the effectiveness of the proposed methods.
arXiv Detail & Related papers (2021-11-09T11:47:43Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP. DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z)
Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function. We consider a randomized approximation of the projected gradient descent algorithm. Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.