Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
- URL: http://arxiv.org/abs/2509.21143v2
- Date: Sat, 27 Sep 2025 15:53:51 GMT
- Title: Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
- Authors: Junfeng Yan, Biao Wu, Meng Fang, Ling Chen,
- Abstract summary: In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns.<n>We introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs.<n>We propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms.
- Score: 37.95018030319752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
Related papers
- Multimodal Large Language Model Framework for Safe and Interpretable Grid-Integrated EVs [3.7098231493739764]
This paper presents a multi-modal large language model (LLM)-based framework to process multimodal sensor data.<n>The framework is validated using real-world data collected from instrumented vehicles driving on urban roads.<n>By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension.
arXiv Detail & Related papers (2025-10-02T21:50:31Z) - VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction [78.34534983766973]
VehicleWorld is the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties.<n>We propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions.
arXiv Detail & Related papers (2025-09-08T14:28:25Z) - LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z) - Progressive Bird's Eye View Perception for Safety-Critical Autonomous Driving: A Comprehensive Survey [20.7823289124196]
Bird's-Eye-View (BEV) perception has become a foundational paradigm in autonomous driving.<n>This survey provides the first comprehensive review of BEV perception from a safety-critical perspective.
arXiv Detail & Related papers (2025-08-11T02:40:46Z) - Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving [10.423977886893278]
We present SCD-Bench, a framework specifically designed to assess the safety cognition capabilities of vision-language models (VLMs) in autonomous driving scenarios.<n>To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving ), a semi-automated labeling system.<n>In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task.
arXiv Detail & Related papers (2025-03-09T07:53:19Z) - SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z) - Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
General MLLMs combined with CLIP often struggle to represent driving-specific scenarios accurately.
We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements.
These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning.
arXiv Detail & Related papers (2024-11-20T06:58:33Z) - Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction [69.29802752614677]
RouteFormer is a novel ego-trajectory prediction network combining GPS data, environmental context, and the driver's field-of-view.<n>To tackle data scarcity and enhance diversity, we introduce GEM, a dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data.
arXiv Detail & Related papers (2023-12-13T23:06:30Z) - Assessing Drivers' Situation Awareness in Semi-Autonomous Vehicles: ASP
based Characterisations of Driving Dynamics for Modelling Scene
Interpretation and Projection [0.0]
We present a framework to asses how aware the driver is about the situation and to provide human-centred assistance.
The framework is developed as a modular system within the Robot Operating System (ROS) with modules for sensing the environment and the driver state.
A particular focus of this paper is on an Answer Set Programming (ASP) based approach for modelling and reasoning about the driver's interpretation and projection of the scene.
arXiv Detail & Related papers (2023-08-30T09:07:49Z) - AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for
Assistive Driving Perception [26.84439405241999]
We present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle.
AIDE facilitates holistic driver monitoring through three distinctive characteristics.
Two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations.
arXiv Detail & Related papers (2023-07-26T03:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.