Related papers: Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

URL: http://arxiv.org/abs/2508.03768v1
Date: Tue, 05 Aug 2025 03:36:50 GMT
Title: Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings
Authors: Debamita Ghosh, George K. Atia, Yue Wang,
Abstract summary: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap.<n>We study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment.<n>We propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions.
Score: 10.983897709591885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment -- assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

Related papers

Online Robust Multi-Agent Reinforcement Learning under Model Uncertainties [10.054572105379425]
Well-trained multi-agent systems can fail when deployed in real-world environments.<n>DRMGs enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties.<n>This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data.
arXiv Detail & Related papers (2025-08-04T23:14:32Z)
Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z)
Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm.<n>Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z)
Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm [14.517103323409307]
Sim-to-real gap represents disparity between training and testing environments. A promising approach to addressing this challenge is distributionally robust RL. We tackle robust RL via interactive data collection and present an algorithm with a provable sample complexity guarantee.
arXiv Detail & Related papers (2024-04-04T16:40:22Z)
Dynamic Environment Responsive Online Meta-Learning with Fairness Awareness [30.44174123736964]
We introduce an innovative adaptive fairness-aware online meta-learning algorithm, referred to as FairSAOML. Our experimental evaluation on various real-world datasets in dynamic environments demonstrates that our proposed FairSAOML algorithm consistently outperforms alternative approaches.
arXiv Detail & Related papers (2024-02-19T17:44:35Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Single-Trajectory Distributionally Robust Reinforcement Learning [21.955807398493334]
We propose Distributionally Robust RL (DRRL) to enhance performance across a range of environments. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. We design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ)
arXiv Detail & Related papers (2023-01-27T14:08:09Z)
Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification [22.241676350331968]
This study focuses on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment. Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.
arXiv Detail & Related papers (2022-11-07T10:18:31Z)
Grounding Aleatoric Uncertainty in Unsupervised Environment Design [32.00797965770773]
In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment. We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
arXiv Detail & Related papers (2022-07-11T22:45:29Z)
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes. A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z)
Pessimistic Model Selection for Offline Deep Reinforcement Learning [56.282483586473816]
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee.
arXiv Detail & Related papers (2021-11-29T06:29:49Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.