Importance Sampling Placement in Off-Policy Temporal-Difference Methods
- URL: http://arxiv.org/abs/2203.10172v1
- Date: Fri, 18 Mar 2022 21:54:09 GMT
- Title: Importance Sampling Placement in Off-Policy Temporal-Difference Methods
- Authors: Eric Graves and Sina Ghiassian
- Abstract summary: We show how off-policy reinforcement learning algorithms correct the entire TD error instead of just the TD target.
Experiments show this subtle modification results in improved performance.
- Score: 3.04585143845864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A central challenge to applying many off-policy reinforcement learning
algorithms to real world problems is the variance introduced by importance
sampling. In off-policy learning, the agent learns about a different policy
than the one being executed. To account for the difference importance sampling
ratios are often used, but can increase variance in the algorithms and reduce
the rate of learning. Several variations of importance sampling have been
proposed to reduce variance, with per-decision importance sampling being the
most popular. However, the update rules for most off-policy algorithms in the
literature depart from per-decision importance sampling in a subtle way; they
correct the entire TD error instead of just the TD target. In this work, we
show how this slight change can be interpreted as a control variate for the TD
target, reducing variance and improving performance. Experiments over a wide
range of algorithms show this subtle modification results in improved
performance.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Sample Dropout: A Simple yet Effective Variance Reduction Technique in
Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate.
We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z) - Stable Target Field for Reduced Variance Score Estimation in Diffusion
Models [5.9115407007859755]
Diffusion models generate samples by reversing a fixed forward diffusion process.
We argue that the source of such variance lies in the handling of intermediate noise-variance scales.
We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets.
arXiv Detail & Related papers (2023-02-01T18:57:01Z) - Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning.
Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed.
We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z) - Do We Need to Penalize Variance of Losses for Learning with Label Noise? [91.38888889609002]
We find that the variance should be increased for the problem of learning with noisy labels.
By exploiting the label noise transition matrix, regularizers can be easily designed to reduce the variance of losses.
Empirically, the proposed method by increasing the variance of losses significantly improves the generalization ability of baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2022-01-30T06:19:08Z) - Why Do Self-Supervised Models Transfer? Investigating the Impact of
Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images.
We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z) - Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Change Point Detection in Time Series Data using Autoencoders with a
Time-Invariant Representation [69.34035527763916]
Change point detection (CPD) aims to locate abrupt property changes in time series data.
Recent CPD methods demonstrated the potential of using deep learning techniques, but often lack the ability to identify more subtle changes in the autocorrelation statistics of the signal.
We employ an autoencoder-based methodology with a novel loss function, through which the used autoencoders learn a partially time-invariant representation that is tailored for CPD.
arXiv Detail & Related papers (2020-08-21T15:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.