Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions
- URL: http://arxiv.org/abs/2112.12281v1
- Date: Thu, 23 Dec 2021 00:07:28 GMT
- Title: Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions
- Authors: Brett Daley and Christopher Amato
- Abstract summary: Off-policy estimation bias is corrected in a per-decision manner.
Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism.
We propose a multistep operator that permits arbitrary past-dependent traces.
- Score: 20.531576904743282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning from multistep returns is crucial for sample-efficient
reinforcement learning, particularly in the experience replay setting now
commonly used with deep neural networks. Classically, off-policy estimation
bias is corrected in a per-decision manner: past temporal-difference errors are
re-weighted by the instantaneous Importance Sampling (IS) ratio (via
eligibility traces) after each action. Many important off-policy algorithms
such as Tree Backup and Retrace rely on this mechanism along with differing
protocols for truncating ("cutting") the ratios ("traces") to counteract the
excessive variance of the IS estimator. Unfortunately, cutting traces on a
per-decision basis is not necessarily efficient; once a trace has been cut
according to local information, the effect cannot be reversed later,
potentially resulting in the premature truncation of estimated returns and
slower learning. In the interest of motivating efficient off-policy algorithms,
we propose a multistep operator that permits arbitrary past-dependent traces.
We prove that our operator is convergent for policy evaluation, and for optimal
control when targeting greedy-in-the-limit policies. Our theorems establish the
first convergence guarantees for many existing algorithms including Truncated
IS, Non-Markov Retrace, and history-dependent TD($\lambda$). Our theoretical
results also provide guidance for the development of new algorithms that
jointly consider multiple past decisions for better credit assignment and
faster learning.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning.
Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed.
We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z) - Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently.
We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z) - Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error.
We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER.
An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z) - Conservative Distributional Reinforcement Learning with Safety
Constraints [22.49025480735792]
Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained.
Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique.
We present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization.
arXiv Detail & Related papers (2022-01-18T19:45:43Z) - Greedy Multi-step Off-Policy Reinforcement Learning [14.720255341733413]
We propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps.
Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-02-23T14:32:20Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring.
Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains.
The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.