Related papers: SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

URL: http://arxiv.org/abs/2508.06763v3
Date: Thu, 30 Oct 2025 23:11:01 GMT
Title: SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Authors: Zihao Sheng, Zilin Huang, Yansong Qu, Jiancong Chen, Yuhao Luo, Yen-Jung Chen, Yue Leng, Sikai Chen,
Abstract summary: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks.<n>We propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis.<n>We curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries.
Score: 13.157191393518595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG

Related papers

Vision-Language Feature Alignment for Road Anomaly Segmentation [38.2615882515309]
We propose a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs)<n>Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories.<n>At inference time, we introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence.
arXiv Detail & Related papers (2026-03-01T10:17:00Z)
Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives [61.64550292163646]
Continual learning in visual understanding aims to deal with catastrophic forgetting in Multimodal Large Language Models (MLLMs)<n>We construct a multimodal visual understanding dataset (MSVQA) encompassing four different scenarios and perspectives.<n>We propose mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives (UNIFIER) to address visual discrepancies while learning different scenarios.
arXiv Detail & Related papers (2025-11-23T15:47:49Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic [8.754321713184483]
We propose Traffic-LM, a multimodal large language model tailored for fine-grained traffic analysis.<n>Our model leverages high-quality traffic-specific multimodal datasets and uses LowRanktemporal Adaptation (LoRA) for lightweight fine-tuning.<n>We also introduce an innovative knowledge module fusing Chain-of-the-art reasoning with Retrieval-Lomented Generation (LoRAG)
arXiv Detail & Related papers (2025-09-14T08:53:06Z)
Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding [5.830619388189558]
This paper introduces a multi-agent framework for comprehensive highway scene understanding.<n>A large generic vision-language model (VLM) is contextualized with domain knowledge to generate task-specific chain-of-thought prompts.<n>The framework simultaneously addresses weather classification, pavement wetness assessment, and traffic congestion detection.
arXiv Detail & Related papers (2025-08-24T03:55:24Z)
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.<n>HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.<n>HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z)
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach.<n>Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation.<n>We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z)
MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios [10.353093987945012]
Multimodal large language models (MLLMs) have shown satisfactory effects in many autonomous driving tasks.<n>In this paper, MLLMs are utilized to solve joint semantic scene understanding and risk localization tasks.<n>Our method achieves 80.1% BLEU-1 score and 298.5% CIDEr score in the scene understanding task, and 59.6% accuracy in the localization task.
arXiv Detail & Related papers (2024-12-27T02:05:38Z)
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
General MLLMs combined with CLIP often struggle to represent driving-specific scenarios accurately. We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning.
arXiv Detail & Related papers (2024-11-20T06:58:33Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events [5.233512464561313]
Multimodal Large Language Models (MLLMs) offer a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis.
arXiv Detail & Related papers (2024-06-19T23:50:41Z)
AccidentBlip: Agent of Accident Warning based on MA-former [24.81148840857782]
AccidentBlip is a vision-only framework that employs our self-designed Motion Accident Transformer (MA-former) to process each frame of video.<n> AccidentBlip achieves performance in both accident detection and prediction tasks on the DeepAccident dataset.<n>It also outperforms current SOTA methods in V2V and V2X scenarios, demonstrating a superior capability to understand complex real-world environments.
arXiv Detail & Related papers (2024-04-18T12:54:25Z)
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems. These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions. We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.