RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by
Backpropagation
- URL: http://arxiv.org/abs/2106.07260v1
- Date: Mon, 14 Jun 2021 09:27:19 GMT
- Title: RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by
Backpropagation
- Authors: Noah Patton, Jihwan Jeong, Michael Gimelfarb, Scott Sanner
- Abstract summary: We introduce Risk-Aware Planning using PyTorch (RAP), a novel framework for risk-sensitive planning through end-to-end optimization of the entropic utility objective.
We evaluate and compare these two forms of RAPTOR on three highly do-mains, including nonlinear navigation, HVAC control, and linear reservoir control.
- Score: 12.600828753197204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Planning provides a framework for optimizing sequential decisions in complex
environments. Recent advances in efficient planning in deterministic or
stochastic high-dimensional domains with continuous action spaces leverage
backpropagation through a model of the environment to directly optimize
actions. However, existing methods typically not take risk into account when
optimizing in stochastic domains, which can be incorporated efficiently in MDPs
by optimizing the entropic utility of returns. We bridge this gap by
introducing Risk-Aware Planning using PyTorch (RAPTOR), a novel framework for
risk-sensitive planning through end-to-end optimization of the entropic utility
objective. A key technical difficulty of our approach lies in that direct
optimization of the entropic utility by backpropagation is impossible due to
the presence of environment stochasticity. The novelty of RAPTOR lies in the
reparameterization of the state distribution, which makes it possible to apply
stochastic backpropagatation through sufficient statistics of the entropic
utility computed from forward-sampled trajectories. The direct optimization of
this empirical objective in an end-to-end manner is called the risk-averse
straight-line plan, which commits to a sequence of actions in advance and can
be sub-optimal in highly stochastic domains. We address this shortcoming by
optimizing for risk-aware Deep Reactive Policies (RaDRP) in our framework. We
evaluate and compare these two forms of RAPTOR on three highly stochastic
do-mains, including nonlinear navigation, HVAC control, and linear reservoir
control, demonstrating the ability to manage risk in complex MDPs.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Risk-sensitive Markov Decision Process and Learning under General
Utility Functions [3.6260136172126667]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.
We propose a modified value algorithm that employs an epsilon-covering over the space of cumulative reward.
In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z) - Learning Regions of Interest for Bayesian Optimization with Adaptive
Level-Set Estimation [84.0621253654014]
We propose a framework, called BALLET, which adaptively filters for a high-confidence region of interest.
We show theoretically that BALLET can efficiently shrink the search space, and can exhibit a tighter regret bound than standard BO.
arXiv Detail & Related papers (2023-07-25T09:45:47Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Hierarchical Policy Blending as Inference for Reactive Robot Control [21.058662668187875]
Motion generation in cluttered, dense, and dynamic environments is a central topic in robotics.
We propose a hierarchical motion generation method that combines the benefits of reactive policies and planning.
Our experimental study in planar navigation and 6DoF manipulation shows that our proposed hierarchical motion generation method outperforms both myopic reactive controllers and online re-planning methods.
arXiv Detail & Related papers (2022-10-14T15:16:54Z) - Risk-Averse Decision Making Under Uncertainty [18.467950783426947]
A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs)
In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures.
arXiv Detail & Related papers (2021-09-09T07:52:35Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution.
This approach poses a number of implementation and optimization challenges.
We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z) - Improving Offline Contextual Bandits with Distributional Robustness [10.310819665706294]
We introduce a convex reformulation of the Counterfactual Risk Minimization principle.
Our approach is compatible with convex programs, and can therefore be readily adapted to the large data regime.
We present preliminary empirical results supporting the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-13T09:52:16Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.