RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System
- URL: http://arxiv.org/abs/2511.18286v1
- Date: Sun, 23 Nov 2025 04:40:50 GMT
- Title: RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System
- Authors: Runwei Guan, Rongsheng Hu, Shangshu Chen, Ningyuan Xiao, Xue Xia, Jiayang Liu, Beibei Chen, Ziren Tang, Ningwei Ouyang, Shaofeng Liang, Yuxuan Fan, Wanjie Sun, Yutao Yue,
- Abstract summary: RoadSceneVQA is a large-scale visual question answering dataset specifically tailored for roadside scenarios.<n>The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions.<n>RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning.
- Score: 15.222742182076459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a vision-language fusion module inspired by human-like scene anchoring mechanisms. Moreover, we propose the Assisted Decoupled Chain-of-Thought (AD-CoT) to enhance the reasoned thinking via CoT prompting and multi-task learning. Based on the above, we propose the baseline model RoadMind. Experiments on RoadSceneVQA and CODA-LM benchmark show that the pipeline consistently improves both reasoning accuracy and computational efficiency, allowing the MLLM to achieve state-of-the-art performance in structural traffic perception and reasoning tasks.
Related papers
- Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic [8.754321713184483]
We propose Traffic-LM, a multimodal large language model tailored for fine-grained traffic analysis.<n>Our model leverages high-quality traffic-specific multimodal datasets and uses LowRanktemporal Adaptation (LoRA) for lightweight fine-tuning.<n>We also introduce an innovative knowledge module fusing Chain-of-the-art reasoning with Retrieval-Lomented Generation (LoRAG)
arXiv Detail & Related papers (2025-09-14T08:53:06Z) - DriveQA: Passing the Driving Knowledge Test [13.569275971952154]
We present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios.<n>We show that state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios.<n>We also demonstrate that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
arXiv Detail & Related papers (2025-08-29T17:59:53Z) - Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding [5.830619388189558]
This paper introduces a multi-agent framework for comprehensive highway scene understanding.<n>A large generic vision-language model (VLM) is contextualized with domain knowledge to generate task-specific chain-of-thought prompts.<n>The framework simultaneously addresses weather classification, pavement wetness assessment, and traffic congestion detection.
arXiv Detail & Related papers (2025-08-24T03:55:24Z) - Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [55.609997552148826]
We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements.<n>These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data.<n>Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
arXiv Detail & Related papers (2024-11-20T06:58:33Z) - Interactive Autonomous Navigation with Internal State Inference and
Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework.
These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents.
Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z) - SEPT: Towards Efficient Scene Representation Learning for Motion
Prediction [19.111948522155004]
This paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful models for complex traffic scenes.
experiments demonstrate that SEPT, without elaborate architectural design or feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks.
arXiv Detail & Related papers (2023-09-26T21:56:03Z) - A Study of Situational Reasoning for Traffic Understanding [63.45021731775964]
We devise three novel text-based tasks for situational reasoning in the traffic domain.
We adopt four knowledge-enhanced methods that have shown generalization capability across language reasoning tasks in prior work.
We provide in-depth analyses of model performance on data partitions and examine model predictions categorically.
arXiv Detail & Related papers (2023-06-05T01:01:12Z) - OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping [84.65114565766596]
We present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure.
OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes.
We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
arXiv Detail & Related papers (2023-04-20T16:31:22Z) - Utilizing Background Knowledge for Robust Reasoning over Traffic
Situations [63.45021731775964]
We focus on a complementary research aspect of Intelligent Transportation: traffic understanding.
We scope our study to text-based methods and datasets given the abundant commonsense knowledge.
We adopt three knowledge-driven approaches for zero-shot QA over traffic situations.
arXiv Detail & Related papers (2022-12-04T09:17:24Z) - TrafficQA: A Question Answering Benchmark and an Efficient Network for
Video Reasoning over Traffic Events [13.46045177335564]
We create a novel dataset, TrafficQA (Traffic Question Answering), based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs.
We propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.
We also propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning.
arXiv Detail & Related papers (2021-03-29T12:12:50Z) - What-If Motion Prediction for Autonomous Driving [58.338520347197765]
Viable solutions must account for both the static geometric context, such as road lanes, and dynamic social interactions arising from multiple actors.
We propose a recurrent graph-based attentional approach with interpretable geometric (actor-lane) and social (actor-actor) relationships.
Our model can produce diverse predictions conditioned on hypothetical or "what-if" road lanes and multi-actor interactions.
arXiv Detail & Related papers (2020-08-24T17:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.