Related papers: Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

URL: http://arxiv.org/abs/2601.18107v1
Date: Mon, 26 Jan 2026 03:38:27 GMT
Title: Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions
Authors: Pedram Agand, Mo Chen,
Abstract summary: We present MoReBRAC, a model-based framework that addresses the distributional shift between the static dataset and the learned policy.<n>We implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout.<n>Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in random'' and suboptimal'' data regimes.
Score: 4.359780028396042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline Reinforcement Learning (ORL) holds immense promise for safety-critical domains like industrial robotics, where real-time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model-based framework that addresses this limitation through Uncertainty-Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual-recurrent world model to synthesize high-fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi-layered filtering process guarantees that only transitions residing within high-confidence regions of the learned dynamics are utilized. Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in ``random'' and ``suboptimal'' data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade-offs encountered when learning from near-optimal datasets.

Related papers

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning [13.655904209137006]
We propose textbfImaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference.<n>Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data.<n>By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference.
arXiv Detail & Related papers (2026-03-04T17:05:39Z)
Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning [22.038062200642162]
offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets.<n>We introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies.
arXiv Detail & Related papers (2026-01-12T12:17:11Z)
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models [0.0]
We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimize the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs)<n>DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs)<n>This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to rank reduction techniques in resource-constrained deep learning.
arXiv Detail & Related papers (2025-12-17T21:09:19Z)
Balance Equation-based Distributionally Robust Offline Imitation Learning [8.607736795429638]
Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible.<n>Standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment.<n>We address this challenge through Balance Equation-based Distributionally Robust Offline Learning.<n>We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution.
arXiv Detail & Related papers (2025-11-11T07:48:09Z)
EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning [63.03672166010434]
We introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework.<n>It jointly synthesizes problems, diverse candidate solutions, and verification artifacts.<n>It iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks.
arXiv Detail & Related papers (2025-10-20T11:56:35Z)
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z)
Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator [50.191655141020505]
Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap.<n>We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates uncertainty to improve policy learning without reliance on a physics simulator.
arXiv Detail & Related papers (2025-04-23T12:58:15Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
Robust Reinforcement Learning using Offline Data [23.260211453437055]
We propose a robust reinforcement learning algorithm called Robust Fitted Q-Iteration (RFQI) RFQI uses only an offline dataset to learn the optimal robust policy. We prove that RFQI learns a near-optimal robust policy under standard assumptions.
arXiv Detail & Related papers (2022-08-10T03:47:45Z)
PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators [19.026312915461553]
We propose a model-based offline reinforcement learning (RL) approach called PerSim. We first learn a personalized simulator for each agent by collectively using the historical trajectories across all agents prior to learning a policy. This representation suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data.
arXiv Detail & Related papers (2021-02-13T17:16:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.