Balance Equation-based Distributionally Robust Offline Imitation Learning
- URL: http://arxiv.org/abs/2511.07942v1
- Date: Wed, 12 Nov 2025 01:29:42 GMT
- Title: Balance Equation-based Distributionally Robust Offline Imitation Learning
- Authors: Rishabh Agrawal, Yusuf Alvi, Rahul Jain, Ashutosh Nayyar,
- Abstract summary: Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible.<n>Standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment.<n>We address this challenge through Balance Equation-based Distributionally Robust Offline Learning.<n>We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution.
- Score: 8.607736795429638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.
Related papers
- Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making [48.998030470623384]
offline decision-making requires reliable behaviors from fixed datasets without further interaction.<n>We propose a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives.
arXiv Detail & Related papers (2025-12-09T06:26:02Z) - Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning [56.47948583452555]
We introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme aligns it with the variational Jordan-Kinderlehrer-Otto principle from optimal transport.<n>SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions.<n>This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages.
arXiv Detail & Related papers (2025-10-17T07:43:51Z) - Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z) - Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery [3.549243565065057]
Imitation learning is a data-driven approach to learning policies from expert behavior.<n>It is prone to unreliable outcomes in out-of-sample (OOS) regions.<n>We propose a framework for learning policies modeled by contractive dynamical systems.
arXiv Detail & Related papers (2024-12-10T14:28:18Z) - Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning [36.01269673940484]
This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift.<n>Our theoretical and empirical investigations reveal how these factors distort value estimation and policy optimization.<n>We derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training.
arXiv Detail & Related papers (2024-08-23T04:25:09Z) - Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning [8.92571113137362]
We address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning.
Unlike state-of-the-art (SOTA IL) methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting.
We demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.
arXiv Detail & Related papers (2024-08-17T07:17:19Z) - Online Nonstochastic Model-Free Reinforcement Learning [35.377261344335736]
We investigate robust model robustness guarantees for environments that may be dynamic or adversarial.
We provide efficient and efficient algorithms for optimizing these policies.
These are the best-known developments in having no dependence on the state-space dimension in having no dependence on the state-space.
arXiv Detail & Related papers (2023-05-27T19:02:55Z) - DROMO: Distributionally Robust Offline Model-based Policy Optimization [0.0]
We consider the problem of offline reinforcement learning with model-based control.
We propose distributionally robust offline model-based policy optimization (DROMO)
arXiv Detail & Related papers (2021-09-15T13:25:14Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.