Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering
- URL: http://arxiv.org/abs/2209.03609v1
- Date: Thu, 8 Sep 2022 07:20:51 GMT
- Title: Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering
- Authors: Jiong Wang, Zhou Zhao, Weike Jin
- Abstract summary: Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question.
We devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used.
We transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores.
- Score: 73.11017833431313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal video question answering aims to predict correct answer and
localize the temporal boundary relevant to the question. The temporal
annotations of questions improve QA performance and interpretability of recent
works, but they are usually empirical and costly. To avoid the temporal
annotations, we devise a weakly supervised question grounding (WSQG) setting,
where only QA annotations are used and the relevant temporal boundaries are
generated according to the temporal attention scores. To substitute the
temporal annotations, we transform the correspondence between frames and
subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the
temporal attention scores and hence improve the video-language understanding in
VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate
that the proposed WSQG strategy gets comparable performance on question
grounding, and the FS self-supervision helps improve the question answering and
grounding performance on both QA-supervision only and full-supervision
settings.
Related papers
- HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.
We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.
Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z) - Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset.
With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding.
We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z) - TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.
We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.
We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.
Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.
We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z) - Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting [15.161997580529075]
This paper explores the novel challenge of VideoQA within a continual learning framework.
We propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting.
Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches.
arXiv Detail & Related papers (2024-10-01T15:07:07Z) - Multi-hop Question Answering under Temporal Knowledge Editing [9.356343796845662]
Multi-hop question answering (MQA) under knowledge editing (KE) has garnered significant attention in the era of large language models.
Existing models for MQA under KE exhibit poor performance when dealing with questions containing explicit temporal contexts.
We propose TEMPoral knowLEdge augmented Multi-hop Question Answering (TEMPLE-MQA) to address this limitation.
arXiv Detail & Related papers (2024-03-30T23:22:51Z) - Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos.
We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.