Conservative Dual Policy Optimization for Efficient Model-Based
Reinforcement Learning
- URL: http://arxiv.org/abs/2209.07676v1
- Date: Fri, 16 Sep 2022 02:27:01 GMT
- Title: Conservative Dual Policy Optimization for Efficient Model-Based
Reinforcement Learning
- Authors: Shenao Zhang
- Abstract summary: We propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update.
A conservative range of randomness is guaranteed by maximizing the expectation of model value.
More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Provably efficient Model-Based Reinforcement Learning (MBRL) based on
optimism or posterior sampling (PSRL) is ensured to attain the global
optimality asymptotically by introducing the complexity measure of the model.
However, the complexity might grow exponentially for the simplest nonlinear
models, where global convergence is impossible within finite iterations. When
the model suffers a large generalization error, which is quantitatively
measured by the model complexity, the uncertainty can be large. The sampled
model that current policy is greedily optimized upon will thus be unsettled,
resulting in aggressive policy updates and over-exploration. In this work, we
propose Conservative Dual Policy Optimization (CDPO) that involves a
Referential Update and a Conservative Update. The policy is first optimized
under a reference model, which imitates the mechanism of PSRL while offering
more stability. A conservative range of randomness is guaranteed by maximizing
the expectation of model value. Without harmful sampling procedures, CDPO can
still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic
policy improvement and global optimality simultaneously. Empirical results also
validate the exploration efficiency of CDPO.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Online Policy Optimization for Robust MDP [17.995448897675068]
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
In this work, we consider online robust Markov decision process (MDP) by interacting with an unknown nominal system.
We propose a robust optimistic policy optimization algorithm that is provably efficient.
arXiv Detail & Related papers (2022-09-28T05:18:20Z) - Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage [33.766012922307084]
We study model-based offline Reinforcement Learning with general function approximation.
We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
arXiv Detail & Related papers (2021-07-13T16:30:01Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Mixed Reinforcement Learning with Additive Stochastic Uncertainty [19.229447330293546]
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency.
This paper presents a mixed RL algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy.
The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of non-affine nonlinear systems.
arXiv Detail & Related papers (2020-02-28T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.