ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies
with Offline Data
- URL: http://arxiv.org/abs/2211.04538v1
- Date: Tue, 8 Nov 2022 20:15:28 GMT
- Title: ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies
with Offline Data
- Authors: Tengyang Xie, Mohak Bhardwaj, Nan Jiang, Ching-An Cheng
- Abstract summary: We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR)
ARMOR robustly learns policies to improve upon an arbitrary baseline policy regardless of data coverage.
- Score: 27.007647483635516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new model-based offline RL framework, called Adversarial Models
for Offline Reinforcement Learning (ARMOR), which can robustly learn policies
to improve upon an arbitrary baseline policy regardless of data coverage. Based
on the concept of relative pessimism, ARMOR is designed to optimize for the
worst-case relative performance when facing uncertainty. In theory, we prove
that the learned policy of ARMOR never degrades the performance of the baseline
policy with any admissible hyperparameter, and can learn to compete with the
best policy within data coverage when the hyperparameter is well tuned, and the
baseline policy is supported by the data. Such a robust policy improvement
property makes ARMOR especially suitable for building real-world learning
systems, because in practice ensuring no performance degradation is imperative
before considering any benefit learning can bring.
Related papers
- Offline Hierarchical Reinforcement Learning via Inverse Optimization [23.664330010602708]
OHIO is a framework for offline reinforcement learning of hierarchical policies.
We show it substantially outperforms end-to-end RL methods and improves robustness.
arXiv Detail & Related papers (2024-10-10T14:00:21Z) - Adversarial Model for Offline Reinforcement Learning [39.77825908168264]
We propose a model-based offline Reinforcement Learning framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR)
ARMOR can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage.
We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms.
arXiv Detail & Related papers (2023-02-21T23:08:09Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - A Unified Framework for Alternating Offline Model Training and Policy
Learning [62.19209005400561]
In offline model-based reinforcement learning, we learn a dynamic model from historically collected data, and utilize the learned model and fixed datasets for policy learning.
We develop an iterative offline MBRL framework, where we maximize a lower bound of the true expected return.
With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets.
arXiv Detail & Related papers (2022-10-12T04:58:51Z) - Robust Reinforcement Learning using Offline Data [23.260211453437055]
We propose a robust reinforcement learning algorithm called Robust Fitted Q-Iteration (RFQI)
RFQI uses only an offline dataset to learn the optimal robust policy.
We prove that RFQI learns a near-optimal robust policy under standard assumptions.
arXiv Detail & Related papers (2022-08-10T03:47:45Z) - RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning [11.183124892686239]
We present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL.
To achieve conservatism, we formulate the problem as a two-player zero sum game against an adversarial environment model.
We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that our approach achieves state of the art performance.
arXiv Detail & Related papers (2022-04-26T20:42:14Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.