ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2210.05158v1
- Date: Tue, 11 Oct 2022 05:37:22 GMT
- Title: ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement
Learning
- Authors: Tung Nguyen, Qinqing Zheng, Aditya Grover
- Abstract summary: The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions.
Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning.
We propose ConserWeightive Behavioral Cloning (CWBC) to improve the performance of conditional BC for offline RL.
- Score: 27.322942155582687
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of offline reinforcement learning (RL) is to learn near-optimal
policies from static logged datasets, thus sidestepping expensive online
interactions. Behavioral cloning (BC) provides a straightforward solution to
offline RL by mimicking offline trajectories via supervised learning. Recent
advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have
shown that by conditioning on desired future returns, BC can perform
competitively to their value-based counterparts, while enjoying much more
simplicity and training stability. However, the distribution of returns in the
offline dataset can be arbitrarily skewed and suboptimal, which poses a unique
challenge for conditioning BC on expert returns at test time. We propose
ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for
improving the performance of conditional BC for offline RL with two key
components: trajectory weighting and conservative regularization. Trajectory
weighting addresses the bias-variance tradeoff in conditional BC and provides a
principled mechanism to learn from both low return trajectories (typically
plentiful) and high return trajectories (typically few). Further, we analyze
the notion of conservatism in existing BC methods, and propose a novel
conservative regularize that explicitly encourages the policy to stay close to
the data distribution. The regularizer helps achieve more reliable performance,
and removes the need for ad-hoc tuning of the conditioning value during
evaluation. We instantiate CWBC in the context of Reinforcement Learning via
Supervised Learning (RvS) (Emmons et al., 2021) and Decision Transformer (DT)
(Chen et al., 2021), and empirically show that it significantly boosts the
performance and stability of prior methods on various offline RL benchmarks.
Code is available at https://github.com/tung-nd/cwbc.
Related papers
- From Imitation to Refinement -- Residual RL for Precise Assembly [19.9786629249219]
Recent advances in behavior cloning (BC), like action-chunking and diffusion, have led to impressive progress.
Our key insight is that chunked BC policies function as trajectory planners, enabling long-horizon tasks.
We present ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL.
arXiv Detail & Related papers (2024-07-23T17:44:54Z) - SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning [33.125187822259186]
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions.
We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe.
arXiv Detail & Related papers (2023-11-03T16:19:33Z) - Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and
Stable Online Fine-Tuning [7.462336024223669]
Key challenge is overcoming overestimation bias for actions not present in data.
One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC)
We demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies.
arXiv Detail & Related papers (2022-11-21T19:10:27Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks.
Current offline RL algorithms are generally designed to be conservative for value estimation and action selection.
We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.