A Case for Validation Buffer in Pessimistic Actor-Critic
- URL: http://arxiv.org/abs/2403.01014v1
- Date: Fri, 1 Mar 2024 22:24:11 GMT
- Title: A Case for Validation Buffer in Pessimistic Actor-Critic
- Authors: Michal Nauman, Mateusz Ostaszewski and Marek Cygan
- Abstract summary: We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value.
We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased.
VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
- Score: 1.5022206231191775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the issue of error accumulation in critic
networks updated via pessimistic temporal difference objectives. We show that
the critic approximation error can be approximated via a recursive fixed-point
model similar to that of the Bellman value. We use such recursive definition to
retrieve the conditions under which the pessimistic critic is unbiased.
Building on these insights, we propose Validation Pessimism Learning (VPL)
algorithm. VPL uses a small validation buffer to adjust the levels of pessimism
throughout the agent training, with the pessimism set such that the
approximation error of the critic targets is minimized. We investigate the
proposed approach on a variety of locomotion and manipulation tasks and report
improvements in sample efficiency and performance.
Related papers
- Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation [2.2120851074630177]
In robotic control tasks, policies trained by reinforcement learning (RL) in simulation often experience a performance drop when deployed on physical hardware.
We propose that Lipschitz regularization can help condition the approximated value function gradients, leading to improved robustness after training.
arXiv Detail & Related papers (2024-04-22T05:01:29Z) - Outlier-Insensitive Kalman Filtering: Theory and Applications [26.889182816155838]
We propose a parameter-free algorithm which mitigates harmful effect of outliers while requiring only a short iterative process of the standard update step of the linear Kalman filter.
arXiv Detail & Related papers (2023-09-18T06:33:28Z) - Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding [58.73333095047114]
We propose an error-based thresholding mechanism for learned ISTA (LISTA)
We show that the proposed EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors.
arXiv Detail & Related papers (2021-12-21T05:07:54Z) - Error Controlled Actor-Critic [7.936003142729818]
On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms.
We propose Error Controlled Actor-critic which ensures confining the approximation error in value function.
arXiv Detail & Related papers (2021-09-06T14:51:20Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.