Related papers: Traffic-Domain Video Question Answering with Automatic Captioning

Traffic-Domain Video Question Answering with Automatic Captioning

URL: http://arxiv.org/abs/2307.09636v1
Date: Tue, 18 Jul 2023 20:56:41 GMT
Title: Traffic-Domain Video Question Answering with Automatic Captioning
Authors: Ehsan Qasemi, Jonathan M. Francis, Alessandro Oltramari
Abstract summary: Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities. We present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models.
Score: 69.98381847388553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities within the domains of Intelligent Traffic Monitoring and Intelligent Transportation Systems. Nevertheless, the integration of urban traffic scene knowledge into VidQA systems has received limited attention in previous research endeavors. In this work, we present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models. Empirical findings obtained from the SUTD-TrafficQA task highlight the substantial enhancements achieved by TRIVIA, elevating the accuracy of representative video-language models by a remarkable 6.5 points (19.88%) compared to baseline settings. This pioneering methodology holds great promise for driving advancements in the field, inspiring researchers and practitioners alike to unlock the full potential of emerging video-language models in traffic-related applications.

Related papers

InterAct-Video: Reasoning-Rich Video QA for Urban Traffic [20.537672896807063]
Deep learning has advanced video-based traffic monitoring through question answering (VideoQA) models.<n>Existing VideoQA models struggle with the complexity of real-world traffic scenes.<n>InterAct VideoQA is a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks.
arXiv Detail & Related papers (2025-07-19T20:30:43Z)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach. Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z)
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks [0.0]
This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences. VideoLLaMA-2 advances with 57% accuracy, particularly in compositional reasoning and consistent answers. These findings underscore VideoQA's potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities.
arXiv Detail & Related papers (2024-12-02T05:15:32Z)
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts. We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z)
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z)
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving [12.004604110512421]
Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
arXiv Detail & Related papers (2024-08-29T15:52:56Z)
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning [0.0]
We present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings.
arXiv Detail & Related papers (2024-04-14T14:51:44Z)
TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models [10.904594811905778]
TrafficGPT is a fusion of ChatGPT and traffic foundation models. By seamlessly intertwining large language model and traffic expertise, TrafficGPT offers a novel approach to leveraging AI capabilities in this domain.
arXiv Detail & Related papers (2023-09-13T04:47:43Z)
A Study of Situational Reasoning for Traffic Understanding [63.45021731775964]
We devise three novel text-based tasks for situational reasoning in the traffic domain. We adopt four knowledge-enhanced methods that have shown generalization capability across language reasoning tasks in prior work. We provide in-depth analyses of model performance on data partitions and examine model predictions categorically.
arXiv Detail & Related papers (2023-06-05T01:01:12Z)
TAU: A Framework for Video-Based Traffic Analytics Leveraging Artificial Intelligence and Unmanned Aerial Systems [2.748428882236308]
We develop an AI-integrated video analytics framework, called TAU (Traffic Analysis from UAVs), for automated traffic analytics and understanding. Unlike previous works on traffic video analytics, we propose an automated object detection and tracking pipeline from video processing to advanced traffic understanding using high-resolution UAV images.
arXiv Detail & Related papers (2023-03-01T09:03:44Z)
Utilizing Background Knowledge for Robust Reasoning over Traffic Situations [63.45021731775964]
We focus on a complementary research aspect of Intelligent Transportation: traffic understanding. We scope our study to text-based methods and datasets given the abundant commonsense knowledge. We adopt three knowledge-driven approaches for zero-shot QA over traffic situations.
arXiv Detail & Related papers (2022-12-04T09:17:24Z)
Intelligent Traffic Monitoring with Hybrid AI [78.65479854534858]
We introduce HANS, a neuro-symbolic architecture for multi-modal context understanding. We show how HANS addresses the challenges associated with traffic monitoring while being able to integrate with a wide range of reasoning methods.
arXiv Detail & Related papers (2022-08-31T17:47:22Z)
TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events [13.46045177335564]
We create a novel dataset, TrafficQA (Traffic Question Answering), based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs. We propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events. We also propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning.
arXiv Detail & Related papers (2021-03-29T12:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.