Related papers: RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation

RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation

URL: http://arxiv.org/abs/2106.07260v1
Date: Mon, 14 Jun 2021 09:27:19 GMT
Title: RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation
Authors: Noah Patton, Jihwan Jeong, Michael Gimelfarb, Scott Sanner
Abstract summary: We introduce Risk-Aware Planning using PyTorch (RAP), a novel framework for risk-sensitive planning through end-to-end optimization of the entropic utility objective. We evaluate and compare these two forms of RAPTOR on three highly do-mains, including nonlinear navigation, HVAC control, and linear reservoir control.
Score: 12.600828753197204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Planning provides a framework for optimizing sequential decisions in complex environments. Recent advances in efficient planning in deterministic or stochastic high-dimensional domains with continuous action spaces leverage backpropagation through a model of the environment to directly optimize actions. However, existing methods typically not take risk into account when optimizing in stochastic domains, which can be incorporated efficiently in MDPs by optimizing the entropic utility of returns. We bridge this gap by introducing Risk-Aware Planning using PyTorch (RAPTOR), a novel framework for risk-sensitive planning through end-to-end optimization of the entropic utility objective. A key technical difficulty of our approach lies in that direct optimization of the entropic utility by backpropagation is impossible due to the presence of environment stochasticity. The novelty of RAPTOR lies in the reparameterization of the state distribution, which makes it possible to apply stochastic backpropagatation through sufficient statistics of the entropic utility computed from forward-sampled trajectories. The direct optimization of this empirical objective in an end-to-end manner is called the risk-averse straight-line plan, which commits to a sequence of actions in advance and can be sub-optimal in highly stochastic domains. We address this shortcoming by optimizing for risk-aware Deep Reactive Policies (RaDRP) in our framework. We evaluate and compare these two forms of RAPTOR on three highly stochastic do-mains, including nonlinear navigation, HVAC control, and linear reservoir control, demonstrating the ability to manage risk in complex MDPs.

Related papers

Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy. We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Efficient Risk-sensitive Planning via Entropic Risk Measures [51.42922439693624]
We show that only Entropic Risk Measures (EntRM) can be efficiently optimized through dynamic programming. We prove that this optimality front can be computed effectively thanks to a novel structural analysis and smoothness properties of entropic risks.
arXiv Detail & Related papers (2025-02-27T09:56:51Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Risk-sensitive Markov Decision Process and Learning under General Utility Functions [3.6260136172126667]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. We propose a modified value algorithm that employs an epsilon-covering over the space of cumulative reward. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z)
Learning Regions of Interest for Bayesian Optimization with Adaptive Level-Set Estimation [84.0621253654014]
We propose a framework, called BALLET, which adaptively filters for a high-confidence region of interest. We show theoretically that BALLET can efficiently shrink the search space, and can exhibit a tighter regret bound than standard BO.
arXiv Detail & Related papers (2023-07-25T09:45:47Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Hierarchical Policy Blending as Inference for Reactive Robot Control [21.058662668187875]
Motion generation in cluttered, dense, and dynamic environments is a central topic in robotics. We propose a hierarchical motion generation method that combines the benefits of reactive policies and planning. Our experimental study in planar navigation and 6DoF manipulation shows that our proposed hierarchical motion generation method outperforms both myopic reactive controllers and online re-planning methods.
arXiv Detail & Related papers (2022-10-14T15:16:54Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO) PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Risk-Averse Decision Making Under Uncertainty [18.467950783426947]
A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs) In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures.
arXiv Detail & Related papers (2021-09-09T07:52:35Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks. We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution. We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution. This approach poses a number of implementation and optimization challenges. We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z)
Improving Offline Contextual Bandits with Distributional Robustness [10.310819665706294]
We introduce a convex reformulation of the Counterfactual Risk Minimization principle. Our approach is compatible with convex programs, and can therefore be readily adapted to the large data regime. We present preliminary empirical results supporting the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-13T09:52:16Z)
Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control. From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly. We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
Chance Constrained Policy Optimization for Process Control and Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation. We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.