Related papers: VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

URL: http://arxiv.org/abs/2601.12672v1
Date: Mon, 19 Jan 2026 02:34:33 GMT
Title: VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness
Authors: Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, Yi Zhang, Zhi-Xin Yang,
Abstract summary: We introduce VILTA, a framework that integrates a Vision Language Models (VLMs) into the closed-loop training of autonomous driving (AD) agents.<n>Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment.<n>We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy.
Score: 16.269781291801667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

Related papers

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z)
Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM [14.016225216093643]
We introduce Risk Semantic Distillation (RSD), a novel framework that leverages Vision-Language Models (VLMs) to enhance the training of End-to-End (E2E) autonomous driving backbones.<n>Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features.<n>Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions.
arXiv Detail & Related papers (2025-11-18T13:46:18Z)
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z)
ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z)
CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models [1.6612510324510592]
CurricuVLM is a novel framework that enables personalized curriculum learning for autonomous driving agents.<n>Our approach exploits Vision-Language Models (VLMs) to analyze agent behavior, identify performance weaknesses, and dynamically generate tailored training scenarios.<n>CurricuVLM outperforms state-of-the-art baselines across both regular and safety-critical scenarios.
arXiv Detail & Related papers (2025-02-21T00:42:40Z)
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.<n>We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.<n>We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [17.36342349850825]
Vision-language models (VLMs) as teachers to enhance training by providing additional supervision.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems. LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z)
Continual Driving Policy Optimization with Closed-Loop Individualized Curricula [2.903150959383393]
We develop a continuous driving policy optimization framework featuring Closed-Loop Individualized Curricula (CLIC) CLIC frames AV Evaluation as a collision prediction task, where it estimates the chance of AV failures in these scenarios at each iteration. We show that CLIC surpasses other curriculum-based training strategies, showing substantial improvement in managing risky scenarios.
arXiv Detail & Related papers (2023-09-25T15:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.