Greedy Multi-step Off-Policy Reinforcement Learning
- URL: http://arxiv.org/abs/2102.11717v1
- Date: Tue, 23 Feb 2021 14:32:20 GMT
- Title: Greedy Multi-step Off-Policy Reinforcement Learning
- Authors: Yuhui Wang, Pengcheng He, Xiaoyang Tan
- Abstract summary: We propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps.
Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance.
- Score: 14.720255341733413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-step off-policy reinforcement learning has achieved great success.
However, existing multi-step methods usually impose a fixed prior on the
bootstrap steps, while the off-policy methods often require additional
correction, suffering from certain undesired effects. In this paper, we propose
a novel bootstrapping method, which greedily takes the maximum value among the
bootstrapping values with varying steps. The new method has two desired
properties:1) it can flexibly adjust the bootstrap step based on the quality of
the data and the learned value function; 2) it can safely and robustly utilize
data from arbitrary behavior policy without additional correction, whatever its
quality or "off-policyness". We analyze the theoretical properties of the
related operator, showing that it is able to converge to the global optimal
value function, with a ratio faster than the traditional Bellman Optimality
Operator. Furthermore, based on this new operator, we derive new model-free RL
algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN).
Experiments reveal that the proposed methods are reliable, easy to implement,
and achieve state-of-the-art performance on a series of standard benchmark
datasets.
Related papers
- Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning.
Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed.
We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z) - Model-based trajectory stitching for improved behavioural cloning and
its applications [7.462336024223669]
Trajectory Stitching (TS) generates new trajectories by stitching' pairs of states that were disconnected in the original data.
We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.
arXiv Detail & Related papers (2022-12-08T14:18:04Z) - Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner.
We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning.
Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors.
Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.