MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2401.11380v1
- Date: Sun, 21 Jan 2024 03:11:50 GMT
- Title: MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning
- Authors: Mao Hong, Zhiyue Zhang, Yue Wu, Yanxun Xu
- Abstract summary: We develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data.
MoMA distinguishes itself from existing literature by employing an unrestricted policy class.
The effectiveness of MoMA is demonstrated via numerical studies.
- Score: 5.399953810215838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based offline reinforcement learning methods (RL) have achieved
state-of-the-art performance in many decision-making problems thanks to their
sample efficiency and generalizability. Despite these advancements, existing
model-based offline RL approaches either focus on theoretical studies without
developing practical algorithms or rely on a restricted parametric policy
space, thus not fully leveraging the advantages of an unrestricted policy space
inherent to model-based methods. To address this limitation, we develop MoMA, a
model-based mirror ascent algorithm with general function approximations under
partial coverage of offline data. MoMA distinguishes itself from existing
literature by employing an unrestricted policy class. In each iteration, MoMA
conservatively estimates the value function by a minimization procedure within
a confidence set of transition models in the policy evaluation step, then
updates the policy with general function approximations instead of
commonly-used parametric policy classes in the policy improvement step. Under
some mild assumptions, we establish theoretical guarantees of MoMA by proving
an upper bound on the suboptimality of the returned policy. We also provide a
practically implementable, approximate version of the algorithm. The
effectiveness of MoMA is demonstrated via numerical studies.
Related papers
- Operator World Models for Reinforcement Learning [37.69110422996011]
Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making.
It is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions.
We introduce a novel approach based on learning a world model of the environment using conditional mean embeddings.
arXiv Detail & Related papers (2024-06-28T12:05:47Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Model Generation with Provable Coverability for Offline Reinforcement
Learning [14.333861814143718]
offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization.
But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration.
We propose an algorithm to generate models optimizing their coverage for the real dynamics.
arXiv Detail & Related papers (2022-06-01T08:34:09Z) - Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage [33.766012922307084]
We study model-based offline Reinforcement Learning with general function approximation.
We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
arXiv Detail & Related papers (2021-07-13T16:30:01Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.