Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region
Approach
- URL: http://arxiv.org/abs/2012.10791v1
- Date: Sat, 19 Dec 2020 21:51:23 GMT
- Title: Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region
Approach
- Authors: James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras
- Abstract summary: We propose a deep policy optimization approach designed to produce stable performance even when data is scarce.
The resulting algorithm, Uncertainty-Aware Trust Region Policy Optimization, generates robust policy updates that adapt to the level of uncertainty present throughout the learning process.
- Score: 8.325359814939517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order for reinforcement learning techniques to be useful in real-world
decision making processes, they must be able to produce robust performance from
limited data. Deep policy optimization methods have achieved impressive results
on complex tasks, but their real-world adoption remains limited because they
often require significant amounts of data to succeed. When combined with small
sample sizes, these methods can result in unstable learning due to their
reliance on high-dimensional sample-based estimates. In this work, we develop
techniques to control the uncertainty introduced by these estimates. We
leverage these techniques to propose a deep policy optimization approach
designed to produce stable performance even when data is scarce. The resulting
algorithm, Uncertainty-Aware Trust Region Policy Optimization, generates robust
policy updates that adapt to the level of uncertainty present throughout the
learning process.
Related papers
- SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement [33.60500554561509]
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data.
To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce.
In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges.
arXiv Detail & Related papers (2025-03-17T02:53:53Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs.
We characterize settings where consistent value estimates are provably not achievable.
We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem.
We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.