A Case for Validation Buffer in Pessimistic Actor-Critic
- URL: http://arxiv.org/abs/2403.01014v1
- Date: Fri, 1 Mar 2024 22:24:11 GMT
- Title: A Case for Validation Buffer in Pessimistic Actor-Critic
- Authors: Michal Nauman, Mateusz Ostaszewski and Marek Cygan
- Abstract summary: We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value.
We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased.
VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
- Score: 1.5022206231191775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the issue of error accumulation in critic
networks updated via pessimistic temporal difference objectives. We show that
the critic approximation error can be approximated via a recursive fixed-point
model similar to that of the Bellman value. We use such recursive definition to
retrieve the conditions under which the pessimistic critic is unbiased.
Building on these insights, we propose Validation Pessimism Learning (VPL)
algorithm. VPL uses a small validation buffer to adjust the levels of pessimism
throughout the agent training, with the pessimism set such that the
approximation error of the critic targets is minimized. We investigate the
proposed approach on a variety of locomotion and manipulation tasks and report
improvements in sample efficiency and performance.
Related papers
- Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation [2.2120851074630177]
In robotic control tasks, policies trained by reinforcement learning (RL) in simulation often experience a performance drop when deployed on physical hardware.
We propose that Lipschitz regularization can help condition the approximated value function gradients, leading to improved robustness after training.
arXiv Detail & Related papers (2024-04-22T05:01:29Z) - Outlier-Insensitive Kalman Filtering: Theory and Applications [29.37450052092755]
We propose a parameter-free algorithm which mitigates harmful effect of outliers while requiring only a short iterative process of the standard update step of the linear Kalman filter.
arXiv Detail & Related papers (2023-09-18T06:33:28Z) - Value-Distributional Model-Based Reinforcement Learning [63.32053223422317]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive
Advantages [41.30585319670119]
This paper introduces an effective and practical step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning.
We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights.
We demonstrate significant improvements for median and interquartile mean metrics over PPO, SAC, and TD3 on the MuJoCo continuous control benchmark.
arXiv Detail & Related papers (2023-06-02T11:37:22Z) - Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding [58.73333095047114]
We propose an error-based thresholding mechanism for learned ISTA (LISTA)
We show that the proposed EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors.
arXiv Detail & Related papers (2021-12-21T05:07:54Z) - Error Controlled Actor-Critic [7.936003142729818]
On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms.
We propose Error Controlled Actor-critic which ensures confining the approximation error in value function.
arXiv Detail & Related papers (2021-09-06T14:51:20Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.