Data-efficient Hindsight Off-policy Option Learning
- URL: http://arxiv.org/abs/2007.15588v2
- Date: Tue, 15 Jun 2021 15:55:50 GMT
- Title: Data-efficient Hindsight Off-policy Option Learning
- Authors: Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas
Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel,
Nicolas Heess, Martin Riedmiller
- Abstract summary: We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm.
It robustly trains all policy components off-policy and end-to-end.
The approach outperforms existing option learning methods on common benchmarks.
- Score: 20.42535406663446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Hindsight Off-policy Options (HO2), a data-efficient option
learning algorithm. Given any trajectory, HO2 infers likely option choices and
backpropagates through the dynamic programming inference procedure to robustly
train all policy components off-policy and end-to-end. The approach outperforms
existing option learning methods on common benchmarks. To better understand the
option framework and disentangle benefits from both temporal and action
abstraction, we evaluate ablations with flat policies and mixture policies with
comparable optimization. The results highlight the importance of both types of
abstraction as well as off-policy training and trust-region constraints,
particularly in challenging, simulated 3D robot manipulation tasks from raw
pixel inputs. Finally, we intuitively adapt the inference step to investigate
the effect of increased temporal abstraction on training with pre-trained
options and from scratch.
Related papers
- Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Multi-Task Option Learning and Discovery for Stochastic Path Planning [27.384742641275228]
This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon path planning problems.
Our approach computes useful options with policies as well as high-level paths that compose the discovered options.
We show that this approach yields strong guarantees of executability and solvability.
arXiv Detail & Related papers (2022-09-30T19:57:52Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Model Selection in Batch Policy Optimization [88.52887493684078]
We study the problem of model selection in batch policy optimization.
We identify three sources of error that any model selection algorithm should optimally trade-off in order to be competitive.
arXiv Detail & Related papers (2021-12-23T02:31:50Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy.
Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates.
We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.