Related papers: Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

URL: http://arxiv.org/abs/2410.00771v1
Date: Tue, 1 Oct 2024 15:07:07 GMT
Title: Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
Authors: Chen Cai, Zheng Wang, Jianjun Gao, Wenyang Liu, Ye Lu, Runzhong Zhang, Kim-Hui Yap,
Abstract summary: This paper explores the novel challenge of VideoQA within a continual learning framework. We propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches.
Score: 15.161997580529075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA, highlighting its practical relevance and effectiveness.

Related papers

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
We introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding.<n> Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships.
arXiv Detail & Related papers (2025-07-20T01:57:00Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z)
Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation. Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models. We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z)
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z)
Top-down Activity Representation Learning for Video Question Answering [4.236280446793381]
Capturing complex hierarchical human activities is crucial for achieving high-performance video question answering (VideoQA) We convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
arXiv Detail & Related papers (2024-09-12T04:43:27Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z)
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering [63.12469700986452]
We introduce the concept of uncertainty-aware curriculum learning (CL) Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments.
arXiv Detail & Related papers (2024-01-03T02:29:34Z)
Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z)
Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.