Structured Two-stream Attention Network for Video Question Answering
- URL: http://arxiv.org/abs/2206.01017v1
- Date: Thu, 2 Jun 2022 12:25:52 GMT
- Title: Structured Two-stream Attention Network for Video Question Answering
- Authors: Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao
Mei, Heng Tao Shen
- Abstract summary: We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
- Score: 168.95603875458113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To date, visual question answering (VQA) (i.e., image QA and video QA) is
still a holy grail in vision and language understanding, especially for video
QA. Compared with image QA that focuses primarily on understanding the
associations between image region-level details and corresponding questions,
video QA requires a model to jointly reason across both spatial and long-range
temporal structures of a video as well as text to provide an accurate answer.
In this paper, we specifically tackle the problem of video QA by proposing a
Structured Two-stream Attention network, namely STA, to answer a free-form or
open-ended natural language question about the content of a given video. First,
we infer rich long-range temporal structures in videos using our structured
segment component and encode text features. Then, our structured two-stream
attention component simultaneously localizes important visual instance, reduces
the influence of background video and focuses on the relevant text. Finally,
the structured two-stream fusion component incorporates different segments of
query and video aware context representation and infers the answers.
Experiments on the large-scale video QA dataset \textit{TGIF-QA} show that our
proposed method significantly surpasses the best counterpart (i.e., with one
representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action,
Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e.,
with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%,
and 5.1%.
Related papers
- Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z) - Discovering Spatio-Temporal Rationales for Video Question Answering [68.33688981540998]
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time.
We propose a Spatio-Temporal Rationalization (STR) that adaptively collects question-critical moments and objects using cross-modal interaction.
We also propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism.
arXiv Detail & Related papers (2023-07-22T12:00:26Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.