Related papers: TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

URL: http://arxiv.org/abs/2502.02449v1
Date: Tue, 04 Feb 2025 16:14:40 GMT
Title: TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
Authors: Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois C. Knoll,
Abstract summary: We present TUMTraffic-VideoQA, a dataset and benchmark designed for understanding complex traffic scenarios.<n>The dataset comprises 1,000 videos, featuring 85,000 multiple-choice pairs, 2,300 object captioning, and 5,700 object annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies.
Score: 26.948071735495237
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Related papers

InterAct-Video: Reasoning-Rich Video QA for Urban Traffic [20.537672896807063]
Deep learning has advanced video-based traffic monitoring through question answering (VideoQA) models.<n>Existing VideoQA models struggle with the complexity of real-world traffic scenes.<n>InterAct VideoQA is a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks.
arXiv Detail & Related papers (2025-07-19T20:30:43Z)
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought [19.792159494513424]
Video comprehension is essential for various applications ranging from video analysis to interactive systems.<n>Despite advancements in vision-language models, these models often struggle to capture nuanced,temporal details.<n>To address this, we introduce Video-Thought, a groundbreaking dataset designed to enhance video understanding.
arXiv Detail & Related papers (2025-06-10T14:08:56Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos [26.97196583891564]
We introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos.<n>Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria.<n>This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion.
arXiv Detail & Related papers (2025-05-26T15:24:06Z)
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks [0.0]
This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences.<n>VideoLLaMA-2 advances with 57% accuracy, particularly in compositional reasoning and consistent answers.<n>These findings underscore VideoQA's potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities.
arXiv Detail & Related papers (2024-12-02T05:15:32Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z)
TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query. To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z)
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos. Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces. BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling. We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.