Distributionally Robust Off-Dynamics Reinforcement Learning: Provable
Efficiency with Linear Function Approximation
- URL: http://arxiv.org/abs/2402.15399v1
- Date: Fri, 23 Feb 2024 16:01:44 GMT
- Title: Distributionally Robust Off-Dynamics Reinforcement Learning: Provable
Efficiency with Linear Function Approximation
- Authors: Zhishuai Liu, Pan Xu
- Abstract summary: We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain.
We provide the first study on online DRMDPs with function approximation for off-dynamics RL.
We introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics with function approximation.
- Score: 8.234072589087095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study off-dynamics Reinforcement Learning (RL), where the policy is
trained on a source domain and deployed to a distinct target domain. We aim to
solve this problem via online distributionally robust Markov decision processes
(DRMDPs), where the learning algorithm actively interacts with the source
domain while seeking the optimal performance under the worst possible dynamics
that is within an uncertainty set of the source domain's transition kernel. We
provide the first study on online DRMDPs with function approximation for
off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity,
even when the nominal transition kernel is linear, leading to error
propagation. By designing a $d$-rectangular uncertainty set using the total
variation distance, we remove this additional nonlinearity and bypass the error
propagation. We then introduce DR-LSVI-UCB, the first provably efficient online
DRMDP algorithm for off-dynamics RL with function approximation, and establish
a polynomial suboptimality bound that is independent of the state and action
space sizes. Our work makes the first step towards a deeper understanding of
the provable efficiency of online DRMDPs with linear function approximation.
Finally, we substantiate the performance and robustness of DR-LSVI-UCB through
different numerical experiments.
Related papers
- Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning [6.969949986864736]
Distributionally robust offline reinforcement learning (RL) seeks robust policy training against environment perturbation by modeling dynamics uncertainty.
We propose minimax optimal and computationally efficient algorithms realizing function approximation.
Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL.
arXiv Detail & Related papers (2024-03-14T17:55:10Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Distributionally Robust Offline Reinforcement Learning with Linear
Function Approximation [16.128778192359327]
We learn an RL agent with the historical data obtained from the source environment and optimize it to perform well in the perturbed one.
We prove our algorithm can achieve the suboptimality of $O(sqrtK)$ depending on the linear function dimension $d$.
arXiv Detail & Related papers (2022-09-14T13:17:59Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Resource Allocation via Model-Free Deep Learning in Free Space Optical
Communications [119.81868223344173]
The paper investigates the general problem of resource allocation for mitigating channel fading effects in Free Space Optical (FSO) communications.
Under this framework, we propose two algorithms that solve FSO resource allocation problems.
arXiv Detail & Related papers (2020-07-27T17:38:51Z) - Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch [60.23815709215807]
We study the inverse reinforcement learning (IRL) problem under a transition dynamics mismatch between the expert and the learner.
We propose a robust MCE IRL algorithm, which is a principled approach to help with this mismatch.
arXiv Detail & Related papers (2020-07-02T14:57:13Z) - Two-stage Deep Reinforcement Learning for Inverter-based Volt-VAR
Control in Active Distribution Networks [3.260913246106564]
We propose a novel two-stage deep reinforcement learning (DRL) method to improve the voltage profile by regulating inverter-based energy resources.
In the offline stage, a highly efficient adversarial reinforcement learning algorithm is developed to train an offline agent robust to the model mismatch.
In the sequential online stage, we transfer the offline agent safely as the online agent to perform continuous learning and controlling online with significantly improved safety and efficiency.
arXiv Detail & Related papers (2020-05-20T08:02:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.