Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness
to Model Misspecification
- URL: http://arxiv.org/abs/2211.03413v1
- Date: Mon, 7 Nov 2022 10:18:31 GMT
- Title: Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness
to Model Misspecification
- Authors: Takumi Tanabe, Rei Sato, Kazuto Fukuchi, Jun Sakuma, Youhei Akimoto
- Abstract summary: This study focuses on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values.
The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment.
Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.
- Score: 22.241676350331968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of reinforcement learning, because of the high cost and risk of
policy training in the real world, policies are trained in a simulation
environment and transferred to the corresponding real-world environment.
However, the simulation environment does not perfectly mimic the real-world
environment, lead to model misspecification. Multiple studies report
significant deterioration of policy performance in a real-world environment. In
this study, we focus on scenarios involving a simulation environment with
uncertainty parameters and the set of their possible values, called the
uncertainty parameter set. The aim is to optimize the worst-case performance on
the uncertainty parameter set to guarantee the performance in the corresponding
real-world environment. To obtain a policy for the optimization, we propose an
off-policy actor-critic approach called the Max-Min Twin Delayed Deep
Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min
optimization problem using a simultaneous gradient ascent descent approach.
Experiments in multi-joint dynamics with contact (MuJoCo) environments show
that the proposed method exhibited a worst-case performance superior to several
baseline approaches.
Related papers
- Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.
We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.
We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Grounding Aleatoric Uncertainty in Unsupervised Environment Design [32.00797965770773]
In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment.
We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
arXiv Detail & Related papers (2022-07-11T22:45:29Z) - Policy Learning for Robust Markov Decision Process with a Mismatched
Generative Model [42.28001762749647]
In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent.
We consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments.
Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties.
arXiv Detail & Related papers (2022-03-13T06:37:25Z) - Uncertainty Aware System Identification with Universal Policies [45.44896435487879]
Sim2real transfer is concerned with transferring policies trained in simulation to potentially noisy real world environments.
We propose Uncertainty-aware policy search (UncAPS), where we use Universal Policy Network (UPN) to store simulation-trained task-specific policies.
We then employ robust Bayesian optimisation to craft robust policies for the given environment by combining relevant UPN policies in a DR like fashion.
arXiv Detail & Related papers (2022-02-11T18:27:23Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Parameterized MDPs and Reinforcement Learning Problems -- A Maximum
Entropy Principle Based Framework [2.741266294612776]
We present a framework to address a class of sequential decision making problems.
Our framework features learning the optimal control policy with robustness to noisy data.
arXiv Detail & Related papers (2020-06-17T04:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.