Efficient Offline Policy Optimization with a Learned Model
- URL: http://arxiv.org/abs/2210.05980v1
- Date: Wed, 12 Oct 2022 07:41:04 GMT
- Title: Efficient Offline Policy Optimization with a Learned Model
- Authors: Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng Yan, Zhongwen Xu
- Abstract summary: MuZero Unplugged presents a promising approach for offline policy learning from logged data.
It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data.
This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline settings.
- Score: 83.64779942889916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: MuZero Unplugged presents a promising approach for offline policy learning
from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned
model and leverages Reanalyze algorithm to learn purely from offline data. For
good performance, MCTS requires accurate learned models and a large number of
simulations, thus costing huge computing time. This paper investigates a few
hypotheses where MuZero Unplugged may not work well under the offline RL
settings, including 1) learning with limited data coverage; 2) learning from
offline data of stochastic environments; 3) improperly parameterized models
given the offline data; 4) with a low compute budget. We propose to use a
regularized one-step look-ahead approach to tackle the above issues. Instead of
planning with the expensive MCTS, we use the learned model to construct an
advantage estimation based on a one-step rollout. Policy improvements are
towards the direction that maximizes the estimated advantage with
regularization of the dataset. We conduct extensive empirical studies with
BSuite environments to verify the hypotheses and then run our algorithm on the
RL Unplugged Atari benchmark. Experimental results show that our proposed
approach achieves stable performance even with an inaccurate learned model. On
the large-scale Atari benchmark, the proposed method outperforms MuZero
Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e.,
1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM
normalized score with the same hardware and software stacks.
Related papers
- Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [60.05963742334746]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers.
Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy.
We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Deep Model Predictive Optimization [21.22047409735362]
A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world.
We propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience.
DMPO can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%.
arXiv Detail & Related papers (2023-10-06T21:11:52Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Model-based Offline Imitation Learning with Non-expert Data [7.615595533111191]
We propose a scalable model-based offline imitation learning algorithmic framework that leverages datasets collected by both suboptimal and optimal policies.
We show that the proposed method textitalways outperforms Behavioral Cloning in the low data regime on simulated continuous control domains.
arXiv Detail & Related papers (2022-06-11T13:08:08Z) - FINETUNA: Fine-tuning Accelerated Molecular Simulations [5.543169726358164]
We present an online active learning framework for accelerating the simulation of atomic systems efficiently and accurately.
A method of transfer learning to incorporate prior information from pre-trained models accelerates simulations by reducing the number of DFT calculations by 91%.
Experiments on 30 benchmark adsorbate-catalyst systems show that our method of transfer learning to incorporate prior information from pre-trained models accelerates simulations by reducing the number of DFT calculations by 91%.
arXiv Detail & Related papers (2022-05-02T21:36:01Z) - Reliable validation of Reinforcement Learning Benchmarks [1.2031796234206134]
Reinforcement Learning (RL) is one of the most dynamic research areas in Game AI and AI as a whole.
There are numerous benchmark environments whose scores are used to compare different algorithms, such as Atari.
We propose improving this situation by providing access to the original experimental data to validate study results.
arXiv Detail & Related papers (2022-03-02T12:55:27Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Online and Offline Reinforcement Learning by Planning with a Learned
Model [15.8026041700727]
We describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points.
We show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions.
We introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL.
arXiv Detail & Related papers (2021-04-13T15:36:06Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.