Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization
- URL: http://arxiv.org/abs/2411.18612v1
- Date: Wed, 27 Nov 2024 18:57:03 GMT
- Title: Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization
- Authors: Cheng Tang, Zhishuai Liu, Pan Xu,
- Abstract summary: We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP)
For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
- Score: 10.465789490644031
- License:
- Abstract: The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the $d$-rectangular linear robust regularized Markov decision process ($d$-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to $d$-RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.
Related papers
- Policy Gradient for Robust Markov Decision Processes [16.281897051782863]
This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (MD), for solving robust Markov Decision Processes (MDPs)
MD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.
We provide a comprehensive analysis of MD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA)
arXiv Detail & Related papers (2024-10-29T15:16:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality [55.06411438416805]
Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications.
This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS- DDR) to efficiently train parametric actor policies.
It is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating
The Worst Kernel [46.373217780462944]
We present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies.
EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process.
Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of EWoK.
arXiv Detail & Related papers (2023-06-09T12:45:41Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with
Non-stationary Objectives and Constraints [8.840221198764482]
We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints.
We propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodic-restart-based optimistic policy evaluation.
arXiv Detail & Related papers (2022-01-28T07:18:29Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.