Related papers: On the Reuse Bias in Off-Policy Reinforcement Learning

On the Reuse Bias in Off-Policy Reinforcement Learning

URL: http://arxiv.org/abs/2209.07074v3
Date: Sun, 21 May 2023 12:40:07 GMT
Title: On the Reuse Bias in Off-Policy Reinforcement Learning
Authors: Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Dong Yan, Jun Zhu
Abstract summary: Reuse Bias is the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective. We present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias.
Score: 28.29153543457396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we finally present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias. Experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.

Related papers

GIPO: Gaussian Importance Sampling Policy Optimization [12.306486689840774]
GIPO is proposed as a policy optimization objective based on truncated importance sampling.<n>It replaces hard clipping with a log-ratio-based Gaussian trust weight to damp extreme importance ratios.<n>GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes.
arXiv Detail & Related papers (2026-03-04T11:34:59Z)
Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z)
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning [52.97053840476386]
We show that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates.<n>We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved.
arXiv Detail & Related papers (2025-11-13T23:06:40Z)
Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems [3.5507492850515323]
inverse propensity scoring (IPS) corrects this bias, but it often suffers from high variance and instability.<n>We present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking objective.<n> Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure.
arXiv Detail & Related papers (2025-08-30T03:14:56Z)
Reparameterization Proximal Policy Optimization [35.59197802340267]
Policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics.<n>We draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse.<n>We propose Re Parameters Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method.<n>RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG.
arXiv Detail & Related papers (2025-08-08T10:50:55Z)
Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations [7.687215328455751]
normalized flow regression (NFR) is a novel offline inference method for approximating posterior distributions. NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We demonstrate NFR's effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology.
arXiv Detail & Related papers (2025-04-15T18:52:33Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation [13.325600043256552]
We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
arXiv Detail & Related papers (2023-10-26T04:41:19Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments [31.492146288630515]
We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
arXiv Detail & Related papers (2023-02-23T01:17:21Z)
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Proposal Distribution Calibration for Few-Shot Object Detection [65.19808035019031]
In few-shot object detection (FSOD), the two-step training paradigm is widely adopted to mitigate the severe sample imbalance. Unfortunately, the extreme data scarcity aggravates the proposal distribution bias, hindering the RoI head from evolving toward novel classes. We introduce a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head.
arXiv Detail & Related papers (2022-12-15T05:09:11Z)
Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation [12.415463205960156]
In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environmentity to better mitigate the negative impacts of noisy supervision.
arXiv Detail & Related papers (2022-01-05T15:46:06Z)
Assessment of Treatment Effect Estimators for Heavy-Tailed Data [70.72363097550483]
A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. We provide a novel cross-validation-like methodology to address this challenge. We evaluate our methodology across 709 RCTs implemented in the Amazon supply chain.
arXiv Detail & Related papers (2021-12-14T17:53:01Z)
Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation. VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z)
GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions. The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.