Offline Reinforcement Learning via High-Fidelity Generative Behavior
Modeling
- URL: http://arxiv.org/abs/2209.14548v1
- Date: Thu, 29 Sep 2022 04:36:23 GMT
- Title: Offline Reinforcement Learning via High-Fidelity Generative Behavior
Modeling
- Authors: Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su and Jun Zhu
- Abstract summary: We show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training.
We adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model.
Our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods.
- Score: 34.88897402357158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In offline reinforcement learning, weighted regression is a common method to
ensure the learned policy stays close to the behavior policy and to prevent
selecting out-of-sample actions. In this work, we show that due to the limited
distributional expressivity of policy models, previous methods might still
select unseen actions during training, which deviates from their initial
motivation. To address this problem, we adopt a generative approach by
decoupling the learned policy into two parts: an expressive generative behavior
model and an action evaluation model. The key insight is that such decoupling
avoids learning an explicitly parameterized policy model with a closed-form
expression. Directly learning the behavior policy allows us to leverage
existing advances in generative modeling, such as diffusion-based methods, to
model diverse behaviors. As for action evaluation, we combine our method with
an in-sample planning technique to further avoid selecting out-of-sample
actions and increase computational efficiency. Experimental results on D4RL
datasets show that our proposed method achieves competitive or superior
performance compared with state-of-the-art offline RL methods, especially in
complex tasks such as AntMaze. We also empirically demonstrate that our method
can successfully learn from a heterogeneous dataset containing multiple
distinctive but similarly successful strategies, whereas previous unimodal
policies fail.
Related papers
- Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Behavior Estimation from Multi-Source Data for Offline Reinforcement
Learning [20.143230846339804]
Behavior estimation aims at estimating the policy with which training data are generated.
This work considers a scenario where the data are collected from multiple sources.
With extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.
arXiv Detail & Related papers (2022-11-29T10:41:44Z) - Random Actions vs Random Policies: Bootstrapping Model-Based Direct
Policy Search [0.0]
This paper studies the impact of the initial data gathering method on the subsequent learning of a dynamics model.
Dynamics models approximate the true transition function of a given task, in order to perform policy search directly on the model.
arXiv Detail & Related papers (2022-10-21T08:26:10Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Know Your Boundaries: The Necessity of Explicit Behavioral Cloning in
Offline RL [28.563015766188478]
We introduce an offline reinforcement learning algorithm that explicitly clones a behavior policy to constrain value learning.
We show state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks.
arXiv Detail & Related papers (2022-06-01T18:04:43Z) - Training and Evaluation of Deep Policies using Reinforcement Learning
and Generative Models [67.78935378952146]
GenRL is a framework for solving sequential decision-making problems.
It exploits the combination of reinforcement learning and latent variable generative models.
We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training.
arXiv Detail & Related papers (2022-04-18T22:02:32Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.