Deployment-Efficient Reinforcement Learning via Model-Based Offline
Optimization
- URL: http://arxiv.org/abs/2006.03647v2
- Date: Tue, 23 Jun 2020 16:54:09 GMT
- Title: Deployment-Efficient Reinforcement Learning via Model-Based Offline
Optimization
- Authors: Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum,
Shixiang Gu
- Abstract summary: We propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning.
We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works.
- Score: 46.017212565714175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most reinforcement learning (RL) algorithms assume online access to the
environment, in which one may readily interleave updates to the policy with
experience collection using that policy. However, in many real-world
applications such as health, education, dialogue agents, and robotics, the cost
or potential risk of deploying a new data-collection policy is high, to the
point that it can become prohibitive to update the data-collection policy more
than a few times during learning. With this view, we propose a novel concept of
deployment efficiency, measuring the number of distinct data-collection
policies that are used during policy learning. We observe that na\"{i}vely
applying existing model-free offline RL algorithms recursively does not lead to
a practical deployment-efficient and sample-efficient algorithm. We propose a
novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that
can effectively optimize a policy offline using 10-20 times fewer data than
prior works. Furthermore, the recursive application of BREMEN is able to
achieve impressive deployment efficiency while maintaining the same or better
sample efficiency, learning successful policies from scratch on simulated
robotic environments with only 5-10 deployments, compared to typical values of
hundreds to millions in standard RL baselines. Codes and pre-trained models are
available at https://github.com/matsuolab/BREMEN .
Related papers
- MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot
Learning [52.101643259906915]
We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations.
Existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains.
We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization.
arXiv Detail & Related papers (2024-01-06T21:04:31Z) - A Unified Framework for Alternating Offline Model Training and Policy
Learning [62.19209005400561]
In offline model-based reinforcement learning, we learn a dynamic model from historically collected data, and utilize the learned model and fixed datasets for policy learning.
We develop an iterative offline MBRL framework, where we maximize a lower bound of the true expected return.
With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets.
arXiv Detail & Related papers (2022-10-12T04:58:51Z) - Fully Decentralized Model-based Policy Optimization for Networked
Systems [23.46407780093797]
This work aims to improve data efficiency of multi-agent control by model-based learning.
We consider networked systems where agents are cooperative and communicate only locally with their neighbors.
In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts.
arXiv Detail & Related papers (2022-07-13T23:52:14Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Statistically Efficient Advantage Learning for Offline Reinforcement
Learning in Infinite Horizons [16.635744815056906]
We consider reinforcement learning methods in offline domains without additional online data collection, such as mobile health applications.
The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator.
arXiv Detail & Related papers (2022-02-26T15:29:46Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.