Robust Reinforcement Learning with Wasserstein Constraint
- URL: http://arxiv.org/abs/2006.00945v1
- Date: Mon, 1 Jun 2020 13:48:59 GMT
- Title: Robust Reinforcement Learning with Wasserstein Constraint
- Authors: Linfang Hou, Liang Pang, Xin Hong, Yanyan Lan, Zhiming Ma, Dawei Yin
- Abstract summary: We show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm.
The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.
- Score: 49.86490922809473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust Reinforcement Learning aims to find the optimal policy with some
extent of robustness to environmental dynamics. Existing learning algorithms
usually enable the robustness through disturbing the current state or
simulating environmental parameters in a heuristic way, which lack quantified
robustness to the system dynamics (i.e. transition probability). To overcome
this issue, we leverage Wasserstein distance to measure the disturbance to the
reference transition kernel. With Wasserstein distance, we are able to connect
transition kernel disturbance to the state disturbance, i.e. reduce an
infinite-dimensional optimization problem to a finite-dimensional risk-aware
problem. Through the derived risk-aware optimal Bellman equation, we show the
existence of optimal robust policies, provide a sensitivity analysis for the
perturbations, and then design a novel robust learning algorithm--Wasserstein
Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the
proposed algorithm is verified in the Cart-Pole environment.
Related papers
- Fast Value Tracking for Deep Reinforcement Learning [7.648784748888187]
Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interact with their environment.
Existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards.
Our research leverages the Kalman paradigm to introduce a novel quantification and sampling algorithm called Langevinized Kalman TemporalTD.
arXiv Detail & Related papers (2024-03-19T22:18:19Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Structural Estimation of Markov Decision Processes in High-Dimensional
State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states.
This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized.
We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Continuous-Time Fitted Value Iteration for Robust Policies [93.25997466553929]
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics.
We propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI)
These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems.
arXiv Detail & Related papers (2021-10-05T11:33:37Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - Parameterized MDPs and Reinforcement Learning Problems -- A Maximum
Entropy Principle Based Framework [2.741266294612776]
We present a framework to address a class of sequential decision making problems.
Our framework features learning the optimal control policy with robustness to noisy data.
arXiv Detail & Related papers (2020-06-17T04:08:35Z) - Jointly Learning Environments and Control Policies with Projected
Stochastic Gradient Ascent [3.118384520557952]
We introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem.
In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation.
We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations.
arXiv Detail & Related papers (2020-06-02T16:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.