Related papers: Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

URL: http://arxiv.org/abs/2511.13397v1
Date: Mon, 17 Nov 2025 14:12:22 GMT
Title: Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising,
Abstract summary: Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed for this purpose.<n>It can be used to evaluate the perception systems of Vision-Language Models (VLMs) in traffic scenarios using trivial yet crucial questions relevant to driving decisions.<n>It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes.
Score: 0.7644902597398215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

Related papers

Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception [0.7644902597398215]
We introduce Distance-Annotated Traffic Perception Question Answering (DTPQA) benchmark.<n>First Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes.<n>We evaluate several state-of-the-art (SOTA) small Vision-Language Models (VLMs) on DTPQA.
arXiv Detail & Related papers (2025-10-09T15:38:41Z)
DriveQA: Passing the Driving Knowledge Test [13.569275971952154]
We present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios.<n>We show that state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios.<n>We also demonstrate that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
arXiv Detail & Related papers (2025-08-29T17:59:53Z)
Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving [27.39309272688527]
Interpretable communication is essential for safe and trustworthy autonomous driving.<n>Current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios.<n>Box-QAymo is a box-referring dataset and benchmark designed to evaluate robustness and finetune VLMs on spatial and temporal reasoning over user-specified objects.
arXiv Detail & Related papers (2025-07-01T07:40:16Z)
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving [16.602141801221364]
STSBench is a framework to benchmark holistic understanding of vision-language models (VLMs) for autonomous driving.<n>The benchmark features 43 diverse scenarios spanning multiple views, resulting in 971 human-verified multiple-choice questions.<n>A thorough evaluation uncovers shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments.
arXiv Detail & Related papers (2025-06-06T16:25:22Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.<n>The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks [0.0]
This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences.<n>VideoLLaMA-2 advances with 57% accuracy, particularly in compositional reasoning and consistent answers.<n>These findings underscore VideoQA's potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities.
arXiv Detail & Related papers (2024-12-02T05:15:32Z)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.<n>We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z)
Utilizing Background Knowledge for Robust Reasoning over Traffic Situations [63.45021731775964]
We focus on a complementary research aspect of Intelligent Transportation: traffic understanding. We scope our study to text-based methods and datasets given the abundant commonsense knowledge. We adopt three knowledge-driven approaches for zero-shot QA over traffic situations.
arXiv Detail & Related papers (2022-12-04T09:17:24Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.