Related papers: Importance Sampling Placement in Off-Policy Temporal-Difference Methods

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

URL: http://arxiv.org/abs/2203.10172v1
Date: Fri, 18 Mar 2022 21:54:09 GMT
Title: Importance Sampling Placement in Off-Policy Temporal-Difference Methods
Authors: Eric Graves and Sina Ghiassian
Abstract summary: We show how off-policy reinforcement learning algorithms correct the entire TD error instead of just the TD target. Experiments show this subtle modification results in improved performance.
Score: 3.04585143845864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being executed. To account for the difference importance sampling ratios are often used, but can increase variance in the algorithms and reduce the rate of learning. Several variations of importance sampling have been proposed to reduce variance, with per-decision importance sampling being the most popular. However, the update rules for most off-policy algorithms in the literature depart from per-decision importance sampling in a subtle way; they correct the entire TD error instead of just the TD target. In this work, we show how this slight change can be interpreted as a control variate for the TD target, reducing variance and improving performance. Experiments over a wide range of algorithms show this subtle modification results in improved performance.

Related papers

Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate. We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z)
Stable Target Field for Reduced Variance Score Estimation in Diffusion Models [5.9115407007859755]
Diffusion models generate samples by reversing a fixed forward diffusion process. We argue that the source of such variance lies in the handling of intermediate noise-variance scales. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets.
arXiv Detail & Related papers (2023-02-01T18:57:01Z)
Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning. Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed. We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z)
Do We Need to Penalize Variance of Losses for Learning with Label Noise? [91.38888889609002]
We find that the variance should be increased for the problem of learning with noisy labels. By exploiting the label noise transition matrix, regularizers can be easily designed to reduce the variance of losses. Empirically, the proposed method by increasing the variance of losses significantly improves the generalization ability of baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2022-01-30T06:19:08Z)
Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images. We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z)
Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale. We show that this phenomenon exists, and then propose a first-order correction term to momentum. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Change Point Detection in Time Series Data using Autoencoders with a Time-Invariant Representation [69.34035527763916]
Change point detection (CPD) aims to locate abrupt property changes in time series data. Recent CPD methods demonstrated the potential of using deep learning techniques, but often lack the ability to identify more subtle changes in the autocorrelation statistics of the signal. We employ an autoencoder-based methodology with a novel loss function, through which the used autoencoders learn a partially time-invariant representation that is tailored for CPD.
arXiv Detail & Related papers (2020-08-21T15:03:21Z)
Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning [32.66049977978746]
Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL) We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
arXiv Detail & Related papers (2018-10-30T07:41:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.