Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL
- URL: http://arxiv.org/abs/2206.02380v2
- Date: Tue, 7 Jun 2022 14:01:50 GMT
- Title: Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL
- Authors: Abhinav Bhatia, Philip S. Thomas, Shlomo Zilberstein
- Abstract summary: We frame the problem of tuning the rollout length as a meta-level sequential decision-making problem.
We use model-free deep reinforcement learning to solve the meta-level decision problem.
- Score: 39.58890668062184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based reinforcement learning promises to learn an optimal policy from
fewer interactions with the environment compared to model-free reinforcement
learning by learning an intermediate model of the environment in order to
predict future interactions. When predicting a sequence of interactions, the
rollout length, which limits the prediction horizon, is a critical
hyperparameter as accuracy of the predictions diminishes in the regions that
are further away from real experience. As a result, with a longer rollout
length, an overall worse policy is learned in the long run. Thus, the
hyperparameter provides a trade-off between quality and efficiency. In this
work, we frame the problem of tuning the rollout length as a meta-level
sequential decision-making problem that optimizes the final policy learned by
model-based reinforcement learning given a fixed budget of environment
interactions by adapting the hyperparameter dynamically based on feedback from
the learning process, such as accuracy of the model and the remaining budget of
interactions. We use model-free deep reinforcement learning to solve the
meta-level decision problem and demonstrate that our approach outperforms
common heuristic baselines on two well-known reinforcement learning
environments.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Deep autoregressive density nets vs neural ensembles for model-based
offline reinforcement learning [2.9158689853305693]
We consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts.
This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system.
We show that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark.
arXiv Detail & Related papers (2024-02-05T10:18:15Z) - Model predictive control-based value estimation for efficient reinforcement learning [6.8237783245324035]
We design an improved reinforcement learning method based on model predictive control that models the environment through a data-driven approach.
Based on the learned environment model, it performs multi-step prediction to estimate the value function and optimize the policy.
The method demonstrates higher learning efficiency, faster convergent speed of strategies tending to the local optimal value, and less sample capacity space required by experience replay buffers.
arXiv Detail & Related papers (2023-10-25T13:55:14Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Efficient Dynamics Modeling in Interactive Environments with Koopman Theory [22.7309724944471]
We show how to efficiently parallelize the sequential problem of long-range prediction using convolution.
We also show that this model can be easily incorporated into dynamics modeling for model-based planning and model-free RL.
arXiv Detail & Related papers (2023-06-20T23:38:24Z) - Revisiting Design Choices in Model-Based Offline Reinforcement Learning [39.01805509055988]
Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies.
This paper compares and designs novel protocols to investigate their interaction with other hyper parameters, such as the number of models, or imaginary rollout horizon.
arXiv Detail & Related papers (2021-10-08T13:51:34Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Discriminator Augmented Model-Based Reinforcement Learning [47.094522301093775]
It is common in practice for the learned model to be inaccurate, impairing planning and leading to poor performance.
This paper aims to improve planning with an importance sampling framework that accounts for discrepancy between the true and learned dynamics.
arXiv Detail & Related papers (2021-03-24T06:01:55Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.