Related papers: Efficient Reinforcement Learning from Demonstration Using Local Ensemble and Reparameterization with Split and Merge of Expert Policies

Efficient Reinforcement Learning from Demonstration Using Local Ensemble and Reparameterization with Split and Merge of Expert Policies

URL: http://arxiv.org/abs/2205.11019v1
Date: Mon, 23 May 2022 03:36:24 GMT
Title: Efficient Reinforcement Learning from Demonstration Using Local Ensemble and Reparameterization with Split and Merge of Expert Policies
Authors: Yu Wang, Fang Liu
Abstract summary: Policy learned from sub-optimal demonstrations may mislead an agent with incorrect or non-local action decisions. We propose a new method called Local Ensemble and Re parameterization with Split and Merge of expert policies (LEARN-SAM) to improve efficiency and make better use of the sub-optimal demonstrations. We demonstrate the superiority of the LEARN-SAM method and its robustness with varying demonstration quality and sparsity in six experiments on complex continuous control problems of low to high dimensions.
Score: 7.126594773940676
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The current work on reinforcement learning (RL) from demonstrations often assumes the demonstrations are samples from an optimal policy, an unrealistic assumption in practice. When demonstrations are generated by sub-optimal policies or have sparse state-action pairs, policy learned from sub-optimal demonstrations may mislead an agent with incorrect or non-local action decisions. We propose a new method called Local Ensemble and Reparameterization with Split and Merge of expert policies (LEARN-SAM) to improve efficiency and make better use of the sub-optimal demonstrations. First, LEARN-SAM employs a new concept, the lambda-function, based on a discrepancy measure between the current state to demonstrated states to "localize" the weights of the expert policies during learning. Second, LEARN-SAM employs a split-and-merge (SAM) mechanism by separating the helpful parts in each expert demonstration and regrouping them into new expert policies to use the demonstrations selectively. Both the lambda-function and SAM mechanism help boost the learning speed. Theoretically, we prove the invariant property of reparameterized policy before and after the SAM mechanism, providing theoretical guarantees for the convergence of the employed policy gradient method. We demonstrate the superiority of the LEARN-SAM method and its robustness with varying demonstration quality and sparsity in six experiments on complex continuous control problems of low to high dimensions, compared to existing methods on RL from demonstration.

Related papers

Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker [9.6508237676589]
A major bottleneck in imitation learning is the requirement of a large number of expert demonstrations. We propose a novel approach named imitation learning via meta-learning an action ranker (ILMAR) ILMAR implements weighted behavior cloning (weighted BC) on a limited set of expert demonstrations along with supplementary demonstrations.
arXiv Detail & Related papers (2024-12-28T16:06:44Z)
What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis [81.15503859645149]
In this paper, we aim to theoretically analyze the impact of in-context demonstrations on large language models' reasoning performance. We propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3.
arXiv Detail & Related papers (2024-12-11T11:38:11Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Dr.ICL: Demonstration-Retrieved In-context Learning [29.142262267850704]
In-context learning (ICL) teaching a large language model to perform a task with few-shot demonstrations has emerged as a strong paradigm for using LLMs. Recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations.
arXiv Detail & Related papers (2023-05-23T14:55:25Z)
Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs) Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages. The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
DEFENDER: DTW-Based Episode Filtering Using Demonstrations for Enhancing RL Safety [0.0]
We propose a task-agnostic method that leverages small sets of safe and unsafe demonstrations to improve the safety of RL agents during learning. We evaluate our method on three tasks from OpenAI Gym's Mujoco benchmark and two state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-05-08T14:23:27Z)
Exploiting Symmetry and Heuristic Demonstrations in Off-policy Reinforcement Learning for Robotic Manipulation [1.7901837062462316]
This paper aims to define and incorporate the natural symmetry present in physical robotic environments. The proposed method is validated via two point-to-point reaching tasks of an industrial arm, with and without an obstacle. A comparison study between the proposed method and a traditional off-policy reinforcement learning algorithm indicates its advantage in learning performance and potential value for applications.
arXiv Detail & Related papers (2023-04-12T11:38:01Z)
D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning [48.57484755946714]
D-Shape is a new method for combining imitation learning (IL) and reinforcement learning (RL) This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy.
arXiv Detail & Related papers (2022-10-26T02:28:32Z)
Robust Learning from Observation with Model Misspecification [33.92371002674386]
Imitation learning (IL) is a popular paradigm for training policies in robotic systems. We propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning.
arXiv Detail & Related papers (2022-02-12T07:04:06Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
Learning Robust State Abstractions for Hidden-Parameter Block MDPs [55.31018404591743]
We leverage ideas of common structure from the HiP-MDP setting to enable robust state abstractions inspired by Block MDPs. We derive instantiations of this new framework for both multi-task reinforcement learning (MTRL) and meta-reinforcement learning (Meta-RL) settings.
arXiv Detail & Related papers (2020-07-14T17:25:27Z)
Hierarchical Variational Imitation Learning of Control Programs [131.7671843857375]
We propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP) Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations. We demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods.
arXiv Detail & Related papers (2019-12-29T08:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.