Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving
- URL: http://arxiv.org/abs/2508.02028v1
- Date: Mon, 04 Aug 2025 03:43:23 GMT
- Title: Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving
- Authors: Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu,
- Abstract summary: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD)<n>Current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs.<n>We introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms.
- Score: 28.224158941451087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.
Related papers
- dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion [3.981488921870863]
This paper proposes a safety-critical testing scenario generation framework.<n>It integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models.<n> Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios.
arXiv Detail & Related papers (2025-12-02T14:56:57Z) - Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving [54.46325690390831]
We propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment.<n>MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine.<n>MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes.
arXiv Detail & Related papers (2025-11-26T17:01:41Z) - EnvTrace: Simulation-Based Semantic Evaluation of LLM Code via Execution Trace Alignment -- Demonstrated at Synchrotron Beamlines [0.19309412407089965]
EnvTrace is a simulation-based method that evaluates execution traces to assess semantic code equivalence.<n>It is demonstrated with a beamline control-logic digital twin to facilitate the evaluation of instrument control code.
arXiv Detail & Related papers (2025-11-13T04:52:01Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z) - Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving [15.903491909277745]
We present a comprehensive evaluation of five key LLM modules.<n>We demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation.<n>In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios.
arXiv Detail & Related papers (2025-09-02T19:02:49Z) - MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving [85.04826012938642]
MetAdv is a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation.<n>It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments.<n>It enables real-time capture of physiological signals and behavioral feedback from drivers.
arXiv Detail & Related papers (2025-08-04T03:07:54Z) - VLMPlanner: Integrating Visual Language Models with Motion Planning [18.633637485218802]
VLMPlanner is a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images.<n>We develop the Context-Adaptive Inference Gate mechanism that enables the VLM to mimic human driving behavior.
arXiv Detail & Related papers (2025-07-27T16:15:21Z) - Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation [0.0]
We introduce a unified agentic framework that leverages large language models (LLMs) for both discrete fault-recovery planning and continuous process control.<n>Our results demonstrate that, with structured feedback and modular agents, LLMs can unify high-level symbolic planningand low-level continuous control.
arXiv Detail & Related papers (2025-07-03T11:20:22Z) - EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM [8.3321872381107]
We introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM.<n>Unlike existing methods, EMAC+ dynamically refines high-level textual plans using real-time feedback from a VLM executing low-level visual control tasks.<n>EMAC+ achieves superior task performance, against noisy observations, and efficient learning.
arXiv Detail & Related papers (2025-05-26T12:34:16Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.84458417662404]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - Perceptual Motor Learning with Active Inference Framework for Robust Lateral Control [0.5437298646956507]
This paper presents a novel Perceptual Motor Learning framework integrated with Active Inference (AIF) to enhance lateral control in Highly Automated Vehicles (HAVs)<n>PML emphasizes the seamless integration of perception and action, enabling efficient decision-making in dynamic environments.<n>Our approach unifies deep learning with active inference principles, allowing HAVs to perform lane-keeping with minimal data and without extensive retraining across different environments.
arXiv Detail & Related papers (2025-03-03T15:49:18Z) - AD-H: Autonomous Driving with Hierarchical Agents [64.49185157446297]
We propose to connect high-level instructions and low-level control signals with mid-level language-driven commands.
We implement this idea through a hierarchical multi-agent driving system named AD-H.
arXiv Detail & Related papers (2024-06-05T17:25:46Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - Variational Latent Branching Model for Off-Policy Evaluation [23.073461349048834]
We propose a variational latent branching model (VLBM) to learn the transition function of Markov decision processes (MDPs)
We introduce the branching architecture to improve the model's robustness against randomly model weights.
We show that the VLBM outperforms existing state-of-the-art OPE methods in general.
arXiv Detail & Related papers (2023-01-28T02:20:03Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.