Related papers: DROMO: Distributionally Robust Offline Model-based Policy Optimization

DROMO: Distributionally Robust Offline Model-based Policy Optimization

URL: http://arxiv.org/abs/2109.07275v1
Date: Wed, 15 Sep 2021 13:25:14 GMT
Title: DROMO: Distributionally Robust Offline Model-based Policy Optimization
Authors: Ruizhen Liu, Dazhi Zhong, Zhicong Chen
Abstract summary: We consider the problem of offline reinforcement learning with model-based control. We propose distributionally robust offline model-based policy optimization (DROMO)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the problem of offline reinforcement learning with model-based control, whose goal is to learn a dynamics model from the experience replay and obtain a pessimism-oriented agent under the learned model. Current model-based constraint includes explicit uncertainty penalty and implicit conservative regularization that pushes Q-values of out-of-distribution state-action pairs down and the in-distribution up. While the uncertainty estimation, on which the former relies on, can be loosely calibrated for complex dynamics, the latter performs slightly better. To extend the basic idea of regularization without uncertainty quantification, we propose distributionally robust offline model-based policy optimization (DROMO), which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out-of-distribution Q-value minimization. We theoretically show that our method optimizes a lower bound on the ground-truth policy evaluation, and it can be incorporated into any existing policy gradient algorithms. We also analyze the theoretical properties of DROMO's linear and non-linear instantiations.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Balance Equation-based Distributionally Robust Offline Imitation Learning [8.607736795429638]
Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible.<n>Standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment.<n>We address this challenge through Balance Equation-based Distributionally Robust Offline Learning.<n>We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution.
arXiv Detail & Related papers (2025-11-11T07:48:09Z)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
Dual Alignment Maximin Optimization for Offline Model-based RL [10.048622079413313]
offline reinforcement agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data. It is a unified framework to ensure both model-environment policy consistency and synthetic and data offline.
arXiv Detail & Related papers (2025-02-02T16:47:35Z)
SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning [9.88109749688605]
Model-based Offline Reinforcement Learning trains policies based on offline datasets and model dynamics. This paper disentangles the problem into two key components: model bias and policy shift. We introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL)
arXiv Detail & Related papers (2024-08-23T04:25:09Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Model-based Offline Policy Optimization with Adversarial Network [0.36868085124383626]
We propose a novel Model-based Offline policy optimization framework with Adversarial Network (MOAN) Key idea is to use adversarial learning to build a transition model with better generalization. Our approach outperforms existing state-of-the-art baselines on widely studied offline RL benchmarks.
arXiv Detail & Related papers (2023-09-05T11:49:33Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Model-based Offline Reinforcement Learning with Local Misspecification [35.75701143290119]
We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch. We propose an empirical algorithm for optimal offline policy selection.
arXiv Detail & Related papers (2023-01-26T21:26:56Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z)
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes. A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.