Mixed Reinforcement Learning with Additive Stochastic Uncertainty
- URL: http://arxiv.org/abs/2003.00848v1
- Date: Fri, 28 Feb 2020 08:02:34 GMT
- Title: Mixed Reinforcement Learning with Additive Stochastic Uncertainty
- Authors: Yao Mu, Shengbo Eben Li, Chang Liu, Qi Sun, Bingbing Nie, Bo Cheng,
and Baiyu Peng
- Abstract summary: Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency.
This paper presents a mixed RL algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy.
The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of non-affine nonlinear systems.
- Score: 19.229447330293546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) methods often rely on massive exploration data to
search optimal policies, and suffer from poor sampling efficiency. This paper
presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously
using dual representations of environmental dynamics to search the optimal
policy with the purpose of improving both learning accuracy and training speed.
The dual representations indicate the environmental model and the state-action
data: the former can accelerate the learning process of RL, while its inherent
model uncertainty generally leads to worse policy accuracy than the latter,
which comes from direct measurements of states and actions. In the framework
design of the mixed RL, the compensation of the additive stochastic model
uncertainty is embedded inside the policy iteration RL framework by using
explored state-action data via iterative Bayesian estimator (IBE). The optimal
policy is then computed in an iterative way by alternating between policy
evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL
is proved using the Bellman's principle of optimality, and the recursive
stability of the generated policy is proved via the Lyapunov's direct method.
The effectiveness of the mixed RL is demonstrated by a typical optimal control
problem of stochastic non-affine nonlinear systems (i.e., double lane change
task with an automated vehicle).
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Adaptive Primal-Dual Method for Safe Reinforcement Learning [9.5147410074115]
We propose, analyze and evaluate adaptive primal-dual (APD) methods for Safe Reinforcement Learning (SRL)
Two adaptive LRs are adjusted to the Lagrangian multipliers so as to optimize the policy in each iteration.
Experiments show that the practical APD algorithm outperforms (or achieves comparable performance) and attains more stable training than the constant LR cases.
arXiv Detail & Related papers (2024-02-01T05:53:44Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Model-Based Offline Reinforcement Learning with Pessimism-Modulated
Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model.
In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief.
We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z) - Stochastic optimal well control in subsurface reservoirs using
reinforcement learning [0.0]
We present a case study of model-free reinforcement learning framework to solve optimal control for a predefined parameter uncertainty distribution.
In principle, RL algorithms are capable of learning optimal action policies to maximize a numerical reward signal.
We present numerical results using two state-of-the-art RL algorithms, proximal policy optimization (PPO) and advantage actor-critic (A2C) on two subsurface flow test cases.
arXiv Detail & Related papers (2022-07-07T17:34:23Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.