Related papers: Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

URL: http://arxiv.org/abs/2112.12281v1
Date: Thu, 23 Dec 2021 00:07:28 GMT
Title: Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions
Authors: Brett Daley and Christopher Amato
Abstract summary: Off-policy estimation bias is corrected in a per-decision manner. Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism. We propose a multistep operator that permits arbitrary past-dependent traces.
Score: 20.531576904743282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating ("cutting") the ratios ("traces") to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces. We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies. Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD($\lambda$). Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.

Related papers

Orthogonal Soft Pruning for Efficient Class Unlearning [26.76186024947296]
We propose a class-aware soft pruning framework to achieve rapid and precise forgetting with millisecond-level response times.<n>Our method decorrelates convolutional filters and disentangles feature representations, while efficiently identifying class-specific channels.
arXiv Detail & Related papers (2025-06-24T09:52:04Z)
Semantic-Aware Remote Estimation of Multiple Markov Sources Under Constraints [9.514904359788156]
We exploit the emphsemantics of information and consider that the remote actuator has different tolerances for the estimation errors.<n>We find an optimal scheduling policy that minimizes the long-term textitstate-dependent costs of estimation errors under a transmission frequency constraint.
arXiv Detail & Related papers (2024-03-25T15:18:23Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning. Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed. We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z)
Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently. We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z)
Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z)
Conservative Distributional Reinforcement Learning with Safety Constraints [22.49025480735792]
Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. We present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization.
arXiv Detail & Related papers (2022-01-18T19:45:43Z)
Greedy Multi-step Off-Policy Reinforcement Learning [14.720255341733413]
We propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-02-23T14:32:20Z)
Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z)
Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains. The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.