Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering
- URL: http://arxiv.org/abs/2112.06197v1
- Date: Sun, 12 Dec 2021 10:35:19 GMT
- Title: Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering
- Authors: Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, Tat-Seng Chua
- Abstract summary: We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
- Score: 80.94367625007352
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video question answering requires models to understand and reason about both
complex video and language data to correctly derive answers. Existing efforts
focus on designing sophisticated cross-modal interactions to fuse the
information from two modalities, while encoding the video and question
holistically as frame and word sequences. Despite their success, these methods
are essentially revolving around the sequential nature of video- and
question-contents, providing little insight to the problem of
question-answering and lacking interpretability as well. In this work, we argue
that while video is presented in frame sequence, the visual elements (eg,
objects, actions, activities and events) are not sequential but rather
hierarchical in semantic space. To align with the multi-granular essence of
linguistic concepts in language queries, we propose to model video as a
conditional graph hierarchy which weaves together visual facts of different
granularity in a level-wise manner, with the guidance of corresponding textual
cues. Despite the simplicity, our extensive experiments demonstrate the
superiority of such conditional hierarchical graph architecture, with clear
performance improvements over prior methods and also better generalization
across different type of questions. Further analyses also consolidate the
model's reliability as it shows meaningful visual-textual evidences for the
predicted answers.
Related papers
- Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
Video-and-Language Inference [81.50675020698662]
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.
We propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions.
We introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
arXiv Detail & Related papers (2021-07-26T15:23:19Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Bridge to Answer: Structure-aware Graph Interaction Network for Video
Question Answering [56.65656211928256]
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video.
We learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions.
Our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.
arXiv Detail & Related papers (2021-04-29T03:02:37Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.