Related papers: Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs

Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs

URL: http://arxiv.org/abs/2601.08470v1
Date: Tue, 13 Jan 2026 11:55:31 GMT
Title: Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs
Authors: Takara Taniguchi, Kuniaki Saito, Atsushi Hashimoto,
Abstract summary: Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems.<n>Current benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with.<n>temporal dynamics.<n>We introduce textbfHazardForge, a scalable pipeline that leverages image editing models to generate.<n>scenarios with layout decision algorithms, and validation modules.
Score: 10.48956192789531
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.

Related papers

WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving [33.850069933308994]
High-level reasoning in safety-critical scenarios remains a major challenge.<n>We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge.<n>We introduceQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios.
arXiv Detail & Related papers (2025-11-25T07:47:27Z)
Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs [30.363301425068162]
We present WM-MoE, the first world model-based motion forecasting framework.<n>It unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios.<n> WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions.
arXiv Detail & Related papers (2025-10-23T11:41:51Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond [101.20320617562321]
AccidentBench is a large-scale benchmark that combines vehicle accident scenarios with Beyond domains.<n>The benchmark contains approximately 2000 videos and over 19000 human-annotated question-answer pairs.
arXiv Detail & Related papers (2025-09-30T17:59:13Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
Embodied Scene Understanding for Vision Language Models via MetaVQA [42.70816811661304]
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications.<n>We present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics.<n>Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations.
arXiv Detail & Related papers (2025-01-15T21:36:19Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model [10.741225574706]
AutoScenario is a framework for realistic corner case generation.<n>It converts safety-critical real-world data from multiple sources into textual representations.<n>It integrates tools from the Simulation of Urban Mobility (SUMO) and CARLA simulators.
arXiv Detail & Related papers (2024-11-29T20:23:28Z)
ADUGS-VINS: Generalized Visual-Inertial Odometry for Robust Navigation in Highly Dynamic and Complex Environments [7.07379964916809]
We introduce ADUGS-VINS, which integrates an enhanced SORT algorithm along with a promptable foundation model into VIO.<n>We evaluate our proposed method using multiple public datasets representing various scenes, as well as in a real-world scenario involving diverse dynamic objects.
arXiv Detail & Related papers (2024-11-28T17:41:33Z)
CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions [13.981748780317329]
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs) This study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA)
arXiv Detail & Related papers (2024-07-25T04:12:49Z)
Towards Evaluating the Robustness of Visual State Space Models [63.14954591606638]
Vision State Space Models (VSSMs) have demonstrated remarkable performance in visual perception tasks. However, their robustness under natural and adversarial perturbations remains a critical concern. We present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios.
arXiv Detail & Related papers (2024-06-13T17:59:44Z)
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z)
SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework. Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations. We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.