Related papers: A Case for Validation Buffer in Pessimistic Actor-Critic

A Case for Validation Buffer in Pessimistic Actor-Critic

URL: http://arxiv.org/abs/2403.01014v1
Date: Fri, 1 Mar 2024 22:24:11 GMT
Title: A Case for Validation Buffer in Pessimistic Actor-Critic
Authors: Michal Nauman, Mateusz Ostaszewski and Marek Cygan
Abstract summary: We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value. We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
Score: 1.5022206231191775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we investigate the issue of error accumulation in critic networks updated via pessimistic temporal difference objectives. We show that the critic approximation error can be approximated via a recursive fixed-point model similar to that of the Bellman value. We use such recursive definition to retrieve the conditions under which the pessimistic critic is unbiased. Building on these insights, we propose Validation Pessimism Learning (VPL) algorithm. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized. We investigate the proposed approach on a variety of locomotion and manipulation tasks and report improvements in sample efficiency and performance.

Related papers

Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning. We show that the widely used beam search method suffers from unacceptable over-optimism. We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Universal Training of Neural Networks to Achieve Bayes Optimal Classification Accuracy [27.092821207089067]
We introduce the Bayes optimal learning threshold (BOLT) loss whose minimization enforces a classification model to achieve the Bayes error rate. Numerical experiments demonstrate that models trained with BOLT achieve performance on par with or exceeding that of cross-entropy.
arXiv Detail & Related papers (2025-01-13T23:55:11Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation [2.2120851074630177]
In robotic control tasks, policies trained by reinforcement learning (RL) in simulation often experience a performance drop when deployed on physical hardware. We propose that Lipschitz regularization can help condition the approximated value function gradients, leading to improved robustness after training.
arXiv Detail & Related papers (2024-04-22T05:01:29Z)
Outlier-Insensitive Kalman Filtering: Theory and Applications [26.889182816155838]
We propose a parameter-free algorithm which mitigates harmful effect of outliers while requiring only a short iterative process of the standard update step of the linear Kalman filter.
arXiv Detail & Related papers (2023-09-18T06:33:28Z)
Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding [58.73333095047114]
We propose an error-based thresholding mechanism for learned ISTA (LISTA) We show that the proposed EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors.
arXiv Detail & Related papers (2021-12-21T05:07:54Z)
Error Controlled Actor-Critic [7.936003142729818]
On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms. We propose Error Controlled Actor-critic which ensures confining the approximation error in value function.
arXiv Detail & Related papers (2021-09-06T14:51:20Z)
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.