Related papers: BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems

BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems

URL: http://arxiv.org/abs/2507.17722v1
Date: Wed, 23 Jul 2025 17:32:17 GMT
Title: BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems
Authors: Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger,
Abstract summary: Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously.<n>When non-restricted, LLMs and vision language models (VLMs) show impressive performance to describe even complex traffic situations.<n>VLMs are prone to hallucination, which means to either potentially not seeing traffic agents who are present in a situation, or to seeing traffic agents who are not there in reality.
Score: 1.701722696403793
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.

Related papers

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach [45.45569862912077]
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection.<n>In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset.<n>Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions.
arXiv Detail & Related papers (2025-11-28T16:09:36Z)
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception [0.7644902597398215]
We introduce Distance-Annotated Traffic Perception Question Answering (DTPQA) benchmark.<n>First Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes.<n>We evaluate several state-of-the-art (SOTA) small Vision-Language Models (VLMs) on DTPQA.
arXiv Detail & Related papers (2025-10-09T15:38:41Z)
LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z)
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation [17.864481047606677]
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning.<n>They still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications.<n>We propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs.
arXiv Detail & Related papers (2025-05-22T02:45:45Z)
Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z)
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation [90.00687889213991]
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.<n>Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.<n>In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
arXiv Detail & Related papers (2025-02-23T20:42:15Z)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach.<n>Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation.<n>We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z)
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks [1.701722696403793]
multimodal Large Language Models (LLMs) are capable of processing text, images, audio, and other data types. This paper systematically assessed potential hallucination detection strategies for such LLMs. We evaluate and extend hallucination detection approaches when applying these LLMs to video sequences in the example of pedestrian detection.
arXiv Detail & Related papers (2024-07-18T20:58:03Z)
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models [16.452638202694246]
This work explores the potential of Vision-Language Foundation Models (VLMs) in detecting hard cases in autonomous driving. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios. We show the effectiveness and feasibility of incorporating our pipeline with state-of-the-art methods on NuScenes datasets.
arXiv Detail & Related papers (2024-05-31T16:35:41Z)
Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.05741859030951]
We propose CODA-LM, the first benchmark for the automatic evaluation of LVLMs for self-driving corner cases.<n>We show that using the text-only large language models as judges reveals even better alignment with human preferences than the LVLM judges.<n>Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task.
arXiv Detail & Related papers (2024-04-16T14:20:55Z)
Learning energy-efficient driving behaviors by imitating experts [75.12960180185105]
This paper examines the role of imitation learning in bridging the gap between control strategies and realistic limitations in communication and sensing. We show that imitation learning can succeed in deriving policies that, if adopted by 5% of vehicles, may boost the energy-efficiency of networks with varying traffic conditions by 15% using only local observations.
arXiv Detail & Related papers (2022-06-28T17:08:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.