Dense but Efficient VideoQA for Intricate Compositional Reasoning
- URL: http://arxiv.org/abs/2210.10300v1
- Date: Wed, 19 Oct 2022 05:01:20 GMT
- Title: Dense but Efficient VideoQA for Intricate Compositional Reasoning
- Authors: Jihyeon Lee, Wooyoung Kang, Eun-Sol Kim
- Abstract summary: We suggest a new VideoQA method based on transformer with a deformable attention mechanism to address the complex tasks.
The dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the semantic relations among question words.
- Score: 9.514382838449928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is well known that most of the conventional video question answering
(VideoQA) datasets consist of easy questions requiring simple reasoning
processes. However, long videos inevitably contain complex and compositional
semantic structures along with the spatio-temporal axis, which requires a model
to understand the compositional structures inherent in the videos. In this
paper, we suggest a new compositional VideoQA method based on transformer
architecture with a deformable attention mechanism to address the complex
VideoQA tasks. The deformable attentions are introduced to sample a subset of
informative visual features from the dense visual feature map to cover a
temporally long range of frames efficiently. Furthermore, the dependency
structure within the complex question sentences is also combined with the
language embeddings to readily understand the relations among question words.
Extensive experiments and ablation studies show that the suggested dense but
efficient model outperforms other baselines.
Related papers
- RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Discovering Spatio-Temporal Rationales for Video Question Answering [68.33688981540998]
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time.
We propose a Spatio-Temporal Rationalization (STR) that adaptively collects question-critical moments and objects using cross-modal interaction.
We also propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism.
arXiv Detail & Related papers (2023-07-22T12:00:26Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval [98.62404433761432]
The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems.
Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries.
We propose a Tree-augmented Cross-modal.
method by jointly learning the linguistic structure of queries and the temporal representation of videos.
arXiv Detail & Related papers (2020-07-06T02:50:27Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.