Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
- URL: http://arxiv.org/abs/2512.02486v1
- Date: Tue, 02 Dec 2025 07:20:39 GMT
- Title: Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
- Authors: Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu,
- Abstract summary: Single-domain offline reinforcement learning (RL) often suffers from limited data coverage.<n>Cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts.<n>In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL.
- Score: 68.18666621908898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
Related papers
- Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z) - On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting [91.38734024438357]
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs)<n>Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data.<n>We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting.
arXiv Detail & Related papers (2025-08-15T11:20:03Z) - Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization [11.320660946946523]
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms.<n>We show that performing on-policy reinforcement learning with an evidential critic provides both.<n>We name the resulting algorithm emphEvidential Proximal Policy Optimization (EPPO) due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages.
arXiv Detail & Related papers (2025-03-03T12:23:07Z) - Dynamic Reinforcement Learning for Actors [0.0]
Dynamic Reinforcement Learning (Dynamic RL) directly controls system dynamics, instead of the actor (action-generating neural network) outputs at each moment.<n>Actor is initially designed to generate chaotic dynamics through the loop with its environment.<n>Dynamic RL controls global system dynamics using a local index called "sensitivity"
arXiv Detail & Related papers (2025-02-14T14:50:05Z) - Improve Robustness of Reinforcement Learning against Observation
Perturbations via $l_\infty$ Lipschitz Policy Networks [8.39061976254379]
Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks.
Recent works have revealed that DRL agents are susceptible to slight perturbations in observations.
We propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations.
arXiv Detail & Related papers (2023-12-14T08:57:22Z) - Two-step dynamic obstacle avoidance [0.0]
This paper proposes a two-step architecture for handling dynamic obstacle avoidance (DOA) tasks by combining supervised and reinforcement learning (RL)
In the first step, we introduce a data-driven approach to estimate the collision risk (CR) of an obstacle using a recurrent neural network.
In the second step, we include these CR estimates into the observation space of an RL agent to increase its situational awareness.
arXiv Detail & Related papers (2023-11-28T14:55:50Z) - Investigating the Edge of Stability Phenomenon in Reinforcement Learning [20.631461205889487]
We explore the edge of stability phenomenon in reinforcement learning (RL)
Despite significant differences to supervised learning, the edge of stability phenomenon can be present in off-policy deep RL.
Our results suggest that, while neural network structure can lead to optimisation dynamics that transfer between problem domains, certain aspects of deep RL optimisation can differentiate it from domains such as supervised learning.
arXiv Detail & Related papers (2023-07-09T15:46:27Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Causal Inference Q-Network: Toward Resilient Reinforcement Learning [57.96312207429202]
We consider a resilient DRL framework with observational interferences.
Under this framework, we propose a causal inference based DRL algorithm called causal inference Q-network (CIQ)
Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
arXiv Detail & Related papers (2021-02-18T23:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.