Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
Video-and-Language Inference
- URL: http://arxiv.org/abs/2107.12270v1
- Date: Mon, 26 Jul 2021 15:23:19 GMT
- Title: Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
Video-and-Language Inference
- Authors: Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang,
Fei Wu, Yi Yang, Yueting Zhuang
- Abstract summary: Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.
We propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions.
We introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
- Score: 81.50675020698662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-and-Language Inference is a recently proposed task for joint
video-and-language understanding. This new task requires a model to draw
inference on whether a natural language statement entails or contradicts a
given video clip. In this paper, we study how to address three critical
challenges for this task: judging the global correctness of the statement
involved multiple semantic meanings, joint reasoning over video and subtitles,
and modeling long-range relationships and complex social interactions. First,
we propose an adaptive hierarchical graph network that achieves in-depth
understanding of the video over complex interactions. Specifically, it performs
joint reasoning over video and subtitles in three hierarchies, where the graph
structure is adaptively adjusted according to the semantic structures of the
statement. Secondly, we introduce semantic coherence learning to explicitly
encourage the semantic coherence of the adaptive hierarchical graph network
from three hierarchies. The semantic coherence learning can further improve the
alignment between vision and linguistics, and the coherence across a sequence
of video segments. Experimental results show that our method significantly
outperforms the baseline by a large margin.
Related papers
- HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.