ADR-BC: Adversarial Density Weighted Regression Behavior Cloning
- URL: http://arxiv.org/abs/2405.20351v1
- Date: Tue, 28 May 2024 06:59:16 GMT
- Title: ADR-BC: Adversarial Density Weighted Regression Behavior Cloning
- Authors: Ziqi Zhang, Zifeng Zhuang, Donglin Wang, Jingzehua Xu, Miao Liu, Shuai Zhang,
- Abstract summary: Imitation Learning (IL) methods first shape a reward or Q function and then use this shaped function within a reinforcement learning framework to optimize the empirical policy.
We propose ADR-BC, which aims to enhance behavior cloning through augmented density-based action support.
As a one-step behavior cloning framework, ADR-BC avoids the cumulative bias associated with multi-step RL frameworks.
- Score: 29.095342729527733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typically, traditional Imitation Learning (IL) methods first shape a reward or Q function and then use this shaped function within a reinforcement learning (RL) framework to optimize the empirical policy. However, if the shaped reward/Q function does not adequately represent the ground truth reward/Q function, updating the policy within a multi-step RL framework may result in cumulative bias, further impacting policy learning. Although utilizing behavior cloning (BC) to learn a policy by directly mimicking a few demonstrations in a single-step updating manner can avoid cumulative bias, BC tends to greedily imitate demonstrated actions, limiting its capacity to generalize to unseen state action pairs. To address these challenges, we propose ADR-BC, which aims to enhance behavior cloning through augmented density-based action support, optimizing the policy with this augmented support. Specifically, the objective of ADR-BC shares the similar physical meanings that matching expert distribution while diverging the sub-optimal distribution. Therefore, ADR-BC can achieve more robust expert distribution matching. Meanwhile, as a one-step behavior cloning framework, ADR-BC avoids the cumulative bias associated with multi-step RL frameworks. To validate the performance of ADR-BC, we conduct extensive experiments. Specifically, ADR-BC showcases a 10.5% improvement over the previous state-of-the-art (SOTA) generalized IL baseline, CEIL, across all tasks in the Gym-Mujoco domain. Additionally, it achieves an 89.5% improvement over Implicit Q Learning (IQL) using real rewards across all tasks in the Adroit and Kitchen domains. On the other hand, we conduct extensive ablations to further demonstrate the effectiveness of ADR-BC.
Related papers
- Offline Behavior Distillation [57.6900189406964]
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions.
We formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data.
We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy.
arXiv Detail & Related papers (2024-10-30T06:28:09Z) - From Imitation to Refinement -- Residual RL for Precise Assembly [19.9786629249219]
Recent advances in behavior cloning (BC), like action-chunking and diffusion, have led to impressive progress.
Our key insight is that chunked BC policies function as trajectory planners, enabling long-horizon tasks.
We present ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL.
arXiv Detail & Related papers (2024-07-23T17:44:54Z) - AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline
Multi-Agent RL via Alternating Stationary Distribution Correction Estimation [65.4532392602682]
One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy.
This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation.
We introduce AlberDICE, an offline MARL algorithm that performs centralized training of individual agents based on stationary distribution optimization.
arXiv Detail & Related papers (2023-11-03T18:56:48Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - How to Train Your DRAGON: Diverse Augmentation Towards Generalizable
Dense Retrieval [80.54532535622988]
We show that a generalizable dense retriever can be trained to achieve high accuracy in both supervised and zero-shot retrieval.
DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.
arXiv Detail & Related papers (2023-02-15T03:53:26Z) - ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation
Learning [48.033516430071494]
We introduce a modified version of behavioral cloning (BC) that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training.
We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.
arXiv Detail & Related papers (2022-11-08T04:54:54Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Where is the Grass Greener? Revisiting Generalized Policy Iteration for
Offline Reinforcement Learning [81.15016852963676]
We re-implement state-of-the-art baselines in the offline RL regime under a fair, unified, and highly factorized framework.
We show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end.
arXiv Detail & Related papers (2021-07-03T11:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.