Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning
- URL: http://arxiv.org/abs/2301.11321v2
- Date: Wed, 31 May 2023 05:13:15 GMT
- Title: Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning
- Authors: Brett Daley, Martha White, Christopher Amato, Marlos C. Machado
- Abstract summary: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning.
Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed.
We propose a multistep operator that can express both per-decision and trajectory-aware methods.
- Score: 44.50394347326546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning from multistep returns is crucial for sample-efficient
reinforcement learning, but counteracting off-policy bias without exacerbating
variance is challenging. Classically, off-policy bias is corrected in a
per-decision manner: past temporal-difference errors are re-weighted by the
instantaneous Importance Sampling (IS) ratio after each action via eligibility
traces. Many off-policy algorithms rely on this mechanism, along with differing
protocols for cutting the IS ratios to combat the variance of the IS estimator.
Unfortunately, once a trace has been fully cut, the effect cannot be reversed.
This has led to the development of credit-assignment strategies that account
for multiple past experiences at a time. These trajectory-aware methods have
not been extensively analyzed, and their theoretical justification remains
uncertain. In this paper, we propose a multistep operator that can express both
per-decision and trajectory-aware methods. We prove convergence conditions for
our operator in the tabular setting, establishing the first guarantees for
several existing methods as well as many new ones. Finally, we introduce
Recency-Bounded Importance Sampling (RBIS), which leverages trajectory
awareness to perform robustly across $\lambda$-values in an off-policy control
task.
Related papers
- Batch Ensemble for Variance Dependent Regret in Stochastic Bandits [41.95653110232677]
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL)
Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that achieves near-optimal regret for Multi-Armed Bandits (MAB)
Our algorithm has just a single parameter namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses.
arXiv Detail & Related papers (2024-09-13T06:40:56Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - A Simple Solution for Offline Imitation from Observations and Examples
with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable.
We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z) - Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction [12.94372063457462]
Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms.
It suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state.
We propose an enhancement tool that accommodates any actor-critic MARL method.
arXiv Detail & Related papers (2022-09-02T13:44:00Z) - Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error.
We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER.
An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z) - Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions [20.531576904743282]
Off-policy estimation bias is corrected in a per-decision manner.
Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism.
We propose a multistep operator that permits arbitrary past-dependent traces.
arXiv Detail & Related papers (2021-12-23T00:07:28Z) - Greedy Multi-step Off-Policy Reinforcement Learning [14.720255341733413]
We propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps.
Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-02-23T14:32:20Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.