Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
- URL: http://arxiv.org/abs/2511.21584v1
- Date: Wed, 26 Nov 2025 17:01:41 GMT
- Title: Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
- Authors: Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao,
- Abstract summary: We propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment.<n>MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine.<n>MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes.
- Score: 54.46325690390831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.
Related papers
- IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning [13.655904209137006]
We propose textbfImaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference.<n>Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data.<n>By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference.
arXiv Detail & Related papers (2026-03-04T17:05:39Z) - Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models [45.523937630646394]
We propose SFP Forecasting as Planning (SFP), a new paradigm in Model Based Reinforcement Learning.<n>SFP constructs a novel World Model to simulate diverse high-temporal future states, enabling an "imagination-based" environmental simulation.
arXiv Detail & Related papers (2025-10-05T03:57:38Z) - Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement [15.002921311530374]
Autoregressive models are a formidable baseline for end-to-end planning in autonomous driving.<n>Their performance is constrained by atemporal misalignment, as the planner must condition future actions on past sensory data.<n>We propose a Time-Invariant Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame.<n>We also introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation.
arXiv Detail & Related papers (2025-09-25T09:24:45Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - GraphSCENE: On-Demand Critical Scenario Generation for Autonomous Vehicles in Simulation [11.896059467313668]
This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences.<n>A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle agents and static structures, guided by real-world interaction patterns.<n>We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.
arXiv Detail & Related papers (2024-10-17T13:02:06Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Planning with Adaptive World Models for Autonomous Driving [50.4439896514353]
We present nuPlan, a real-world motion planning benchmark that captures multi-agent interactions.<n>We learn to model such unique behaviors with BehaviorNet, a graph convolutional neural network (GCNN)<n>We also present AdaptiveDriver, a model-predictive control (MPC) based planner that unrolls different world models conditioned on BehaviorNet's predictions.
arXiv Detail & Related papers (2024-06-15T18:53:45Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Learning Robust Policies for Generalized Debris Capture with an
Automated Tether-Net System [2.0429716172112617]
This paper presents a reinforcement learning framework that integrates a policy optimization approach with net dynamics simulations.
A state transition model is considered in order to incorporate synthetic uncertainties in state estimation and launch actuation.
The trained policy demonstrates capture performance close to that obtained with reliability-based optimization run over an individual scenario.
arXiv Detail & Related papers (2022-01-11T20:09:05Z) - Autoregressive Dynamics Models for Offline Policy Evaluation and
Optimization [60.73540999409032]
We show that expressive autoregressive dynamics models generate different dimensions of the next state and reward sequentially conditioned on previous dimensions.
We also show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer.
arXiv Detail & Related papers (2021-04-28T16:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.