CROP: Conservative Reward for Model-based Offline Policy Optimization
- URL: http://arxiv.org/abs/2310.17245v1
- Date: Thu, 26 Oct 2023 08:45:23 GMT
- Title: CROP: Conservative Reward for Model-based Offline Policy Optimization
- Authors: Hao Li, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng,
Xiao-Yin Liu, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Bo-Xian Yao,
Zeng-Guang Hou
- Abstract summary: This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP)
To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions.
Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques.
- Score: 15.121328040092264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) aims to optimize policy using collected
data without online interactions. Model-based approaches are particularly
appealing for addressing offline RL challenges due to their capability to
mitigate the limitations of offline data through data generation using models.
Prior research has demonstrated that introducing conservatism into the model or
Q-function during policy optimization can effectively alleviate the prevalent
distribution drift problem in offline RL. However, the investigation into the
impacts of conservatism in reward estimation is still lacking. This paper
proposes a novel model-based offline RL algorithm, Conservative Reward for
model-based Offline Policy optimization (CROP), which conservatively estimates
the reward in model training. To achieve a conservative reward estimation, CROP
simultaneously minimizes the estimation error and the reward of random actions.
Theoretical analysis shows that this conservative reward mechanism leads to a
conservative policy evaluation and helps mitigate distribution drift.
Experiments on D4RL benchmarks showcase that the performance of CROP is
comparable to the state-of-the-art baselines. Notably, CROP establishes an
innovative connection between offline and online RL, highlighting that offline
RL problems can be tackled by adopting online RL techniques to the empirical
Markov decision process trained with a conservative reward. The source code is
available with https://github.com/G0K0URURI/CROP.git.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Behavior Proximal Policy Optimization [14.701955559885615]
offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly.
Online on-policy algorithms are naturally able to solve offline RL.
We propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization.
arXiv Detail & Related papers (2023-02-22T11:49:12Z) - Offline RL Policies Should be Trained to be Adaptive [89.8580376798065]
We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP.
As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.
We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.
arXiv Detail & Related papers (2022-07-05T17:58:33Z) - RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks.
Current offline RL algorithms are generally designed to be conservative for value estimation and action selection.
We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z) - Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement
Learning [125.8224674893018]
Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment.
Applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions.
We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.
arXiv Detail & Related papers (2022-02-23T15:27:16Z) - Offline Reinforcement Learning with Reverse Model-based Imagination [25.376888160137973]
In offline reinforcement learning (offline RL), one of the main challenges is to deal with the distributional shift between the learning policy and the given dataset.
Recent offline RL methods attempt to introduce conservatism bias to encourage learning on high-confidence areas.
We propose a novel model-based offline RL framework, called Reverse Offline Model-based Imagination (ROMI)
arXiv Detail & Related papers (2021-10-01T03:13:22Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.