Traffic-Domain Video Question Answering with Automatic Captioning
- URL: http://arxiv.org/abs/2307.09636v1
- Date: Tue, 18 Jul 2023 20:56:41 GMT
- Title: Traffic-Domain Video Question Answering with Automatic Captioning
- Authors: Ehsan Qasemi, Jonathan M. Francis, Alessandro Oltramari
- Abstract summary: Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities.
We present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models.
- Score: 69.98381847388553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering (VidQA) exhibits remarkable potential in
facilitating advanced machine reasoning capabilities within the domains of
Intelligent Traffic Monitoring and Intelligent Transportation Systems.
Nevertheless, the integration of urban traffic scene knowledge into VidQA
systems has received limited attention in previous research endeavors. In this
work, we present a novel approach termed Traffic-domain Video Question
Answering with Automatic Captioning (TRIVIA), which serves as a
weak-supervision technique for infusing traffic-domain knowledge into large
video-language models. Empirical findings obtained from the SUTD-TrafficQA task
highlight the substantial enhancements achieved by TRIVIA, elevating the
accuracy of representative video-language models by a remarkable 6.5 points
(19.88%) compared to baseline settings. This pioneering methodology holds great
promise for driving advancements in the field, inspiring researchers and
practitioners alike to unlock the full potential of emerging video-language
models in traffic-related applications.
Related papers
- AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts.
We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving [12.004604110512421]
Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving.
We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
arXiv Detail & Related papers (2024-08-29T15:52:56Z) - TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning [0.0]
We present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view.
Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings.
arXiv Detail & Related papers (2024-04-14T14:51:44Z) - TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation
Models [10.904594811905778]
TrafficGPT is a fusion of ChatGPT and traffic foundation models.
By seamlessly intertwining large language model and traffic expertise, TrafficGPT offers a novel approach to leveraging AI capabilities in this domain.
arXiv Detail & Related papers (2023-09-13T04:47:43Z) - A Study of Situational Reasoning for Traffic Understanding [63.45021731775964]
We devise three novel text-based tasks for situational reasoning in the traffic domain.
We adopt four knowledge-enhanced methods that have shown generalization capability across language reasoning tasks in prior work.
We provide in-depth analyses of model performance on data partitions and examine model predictions categorically.
arXiv Detail & Related papers (2023-06-05T01:01:12Z) - TAU: A Framework for Video-Based Traffic Analytics Leveraging Artificial
Intelligence and Unmanned Aerial Systems [2.748428882236308]
We develop an AI-integrated video analytics framework, called TAU (Traffic Analysis from UAVs), for automated traffic analytics and understanding.
Unlike previous works on traffic video analytics, we propose an automated object detection and tracking pipeline from video processing to advanced traffic understanding using high-resolution UAV images.
arXiv Detail & Related papers (2023-03-01T09:03:44Z) - Utilizing Background Knowledge for Robust Reasoning over Traffic
Situations [63.45021731775964]
We focus on a complementary research aspect of Intelligent Transportation: traffic understanding.
We scope our study to text-based methods and datasets given the abundant commonsense knowledge.
We adopt three knowledge-driven approaches for zero-shot QA over traffic situations.
arXiv Detail & Related papers (2022-12-04T09:17:24Z) - Intelligent Traffic Monitoring with Hybrid AI [78.65479854534858]
We introduce HANS, a neuro-symbolic architecture for multi-modal context understanding.
We show how HANS addresses the challenges associated with traffic monitoring while being able to integrate with a wide range of reasoning methods.
arXiv Detail & Related papers (2022-08-31T17:47:22Z) - TrafficQA: A Question Answering Benchmark and an Efficient Network for
Video Reasoning over Traffic Events [13.46045177335564]
We create a novel dataset, TrafficQA (Traffic Question Answering), based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs.
We propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.
We also propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning.
arXiv Detail & Related papers (2021-03-29T12:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.