ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation
Learning
- URL: http://arxiv.org/abs/2211.04005v1
- Date: Tue, 8 Nov 2022 04:54:54 GMT
- Title: ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation
Learning
- Authors: Eddy Hudson and Ishan Durugkar and Garrett Warnell and Peter Stone
- Abstract summary: We introduce a modified version of behavioral cloning (BC) that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training.
We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.
- Score: 48.033516430071494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a dataset of expert agent interactions with an environment of interest,
a viable method to extract an effective agent policy is to estimate the maximum
likelihood policy indicated by this data. This approach is commonly referred to
as behavioral cloning (BC). In this work, we describe a key disadvantage of BC
that arises due to the maximum likelihood objective function; namely that BC is
mean-seeking with respect to the state-conditional expert action distribution
when the learner's policy is represented with a Gaussian. To address this
issue, we introduce a modified version of BC, Adversarial Behavioral Cloning
(ABC), that exhibits mode-seeking behavior by incorporating elements of GAN
(generative adversarial network) training. We evaluate ABC on toy domains and a
domain based on Hopper from the DeepMind Control suite, and show that it
outperforms standard BC by being mode-seeking in nature.
Related papers
- From Imitation to Refinement -- Residual RL for Precise Assembly [19.9786629249219]
Behavior cloning (BC) has enabled impressive capabilities, but imitation is insufficient for learning reliable policies for tasks requiring precise aligning and inserting of objects, like assembly.
We present ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL.
Evaluation on high-precision manipulation tasks demonstrates strong performance of ResiP over BC methods and direct RL fine-tuning.
arXiv Detail & Related papers (2024-07-23T17:44:54Z) - ADR-BC: Adversarial Density Weighted Regression Behavior Cloning [29.095342729527733]
Imitation Learning (IL) methods first shape a reward or Q function and then use this shaped function within a reinforcement learning framework to optimize the empirical policy.
We propose ADR-BC, which aims to enhance behavior cloning through augmented density-based action support.
As a one-step behavior cloning framework, ADR-BC avoids the cumulative bias associated with multi-step RL frameworks.
arXiv Detail & Related papers (2024-05-28T06:59:16Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - Learnable Behavior Control: Breaking Atari Human World Records via
Sample-Efficient Behavior Selection [56.87650511573298]
We propose a general framework called Learnable Behavioral Control (LBC) to address the limitation.
Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames.
arXiv Detail & Related papers (2023-05-09T08:00:23Z) - TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from
Mixed Datasets [118.22975463000928]
We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies.
There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies.
In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm.
arXiv Detail & Related papers (2022-12-05T09:36:23Z) - Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and
Stable Online Fine-Tuning [7.462336024223669]
Key challenge is overcoming overestimation bias for actions not present in data.
One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC)
We demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies.
arXiv Detail & Related papers (2022-11-21T19:10:27Z) - Collapse by Conditioning: Training Class-conditional GANs with Limited
Data [109.30895503994687]
We propose a training strategy for conditional GANs (cGANs) that effectively prevents the observed mode-collapse by leveraging unconditional learning.
Our training strategy starts with an unconditional GAN and gradually injects conditional information into the generator and the objective function.
The proposed method for training cGANs with limited data results not only in stable training but also in generating high-quality images.
arXiv Detail & Related papers (2022-01-17T18:59:23Z) - Object-Aware Regularization for Addressing Causal Confusion in Imitation
Learning [131.1852444489217]
This paper presents Object-aware REgularizatiOn (OREO), a technique that regularizes an imitation policy in an object-aware manner.
Our main idea is to encourage a policy to uniformly attend to all semantic objects, in order to prevent the policy from exploiting nuisance variables strongly correlated with expert actions.
arXiv Detail & Related papers (2021-10-27T01:56:23Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.