Locate before Answering: Answer Guided Question Localization for Video
Question Answering
- URL: http://arxiv.org/abs/2210.02081v2
- Date: Thu, 12 Oct 2023 09:00:34 GMT
- Title: Locate before Answering: Answer Guided Question Localization for Video
Question Answering
- Authors: Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, and
Yu-Gang Jiang
- Abstract summary: LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
- Score: 70.38700123685143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video question answering (VideoQA) is an essential task in vision-language
understanding, which has attracted numerous research attention recently.
Nevertheless, existing works mostly achieve promising performances on short
videos of duration within 15 seconds. For VideoQA on minute-level long-term
videos, those methods are likely to fail because of lacking the ability to deal
with noise and redundancy caused by scene changes and multiple actions in the
video. Considering the fact that the question often remains concentrated in a
short temporal range, we propose to first locate the question to a segment in
the video and then infer the answer using the located segment only. Under this
scheme, we propose "Locate before Answering" (LocAns), a novel approach that
integrates a question locator and an answer predictor into an end-to-end model.
During the training phase, the available answer label not only serves as the
supervision signal of the answer predictor, but also is used to generate pseudo
temporal labels for the question locator. Moreover, we design a decoupled
alternative training strategy to update the two modules separately. In the
experiments, LocAns achieves state-of-the-art performance on two modern
long-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative
examples show the reliable performance of the question localization.
Related papers
- STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering [42.173245795917026]
We propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering.
STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks.
We conduct extensive experiments on several video question answering datasets to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available.
arXiv Detail & Related papers (2024-01-08T14:01:59Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - Discovering Spatio-Temporal Rationales for Video Question Answering [68.33688981540998]
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time.
We propose a Spatio-Temporal Rationalization (STR) that adaptively collects question-critical moments and objects using cross-modal interaction.
We also propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism.
arXiv Detail & Related papers (2023-07-22T12:00:26Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering [73.11017833431313]
Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question.
We devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used.
We transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores.
arXiv Detail & Related papers (2022-09-08T07:20:51Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.