Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding
- URL: http://arxiv.org/abs/2301.09071v1
- Date: Sun, 22 Jan 2023 08:02:23 GMT
- Title: Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding
- Authors: Juncheng Li, Siliang Tang, Linchao Zhu, Wenqiao Zhang, Yi Yang,
Tat-Seng Chua, Fei Wu, Yueting Zhuang
- Abstract summary: Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
- Score: 143.5927158318524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal grounding is the task of locating a specific segment from an
untrimmed video according to a query sentence. This task has achieved
significant momentum in the computer vision community as it enables activity
grounding beyond pre-defined activity classes by utilizing the semantic
diversity of natural language descriptions. The semantic diversity is rooted in
the principle of compositionality in linguistics, where novel semantics can be
systematically described by combining known words in novel ways (compositional
generalization). However, existing temporal grounding datasets are not
carefully designed to evaluate the compositional generalizability. To
systematically benchmark the compositional generalizability of temporal
grounding models, we introduce a new Compositional Temporal Grounding task and
construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. When
evaluating the state-of-the-art methods on our new dataset splits, we
empirically find that they fail to generalize to queries with novel
combinations of seen words. We argue that the inherent structured semantics
inside the videos and language is the crucial factor to achieve compositional
generalization. Based on this insight, we propose a variational cross-graph
reasoning framework that explicitly decomposes video and language into
hierarchical semantic graphs, respectively, and learns fine-grained semantic
correspondence between the two graphs. Furthermore, we introduce a novel
adaptive structured semantics learning approach to derive the
structure-informed and domain-generalizable graph representations, which
facilitate the fine-grained semantic correspondence reasoning between the two
graphs. Extensive experiments validate the superior compositional
generalizability of our approach.
Related papers
- A Comprehensive Empirical Evaluation of Existing Word Embedding
Approaches [5.065947993017158]
We present the characteristics of existing word embedding approaches and analyze them with regard to many classification tasks.
Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well.
On the other hand, Neural-network-based approaches can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations.
arXiv Detail & Related papers (2023-03-13T15:34:19Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Plurality and Quantification in Graph Representation of Meaning [4.82512586077023]
Our graph language covers the essentials of natural language semantics using only monadic second-order variables.
We present a unification-based mechanism for constructing semantic graphs at a simple syntax-semantics interface.
The present graph formalism is applied to linguistic issues in distributive predication, cross-categorial conjunction, and scope permutation of quantificational expressions.
arXiv Detail & Related papers (2021-12-13T07:04:41Z) - Learning to Generalize Compositionally by Transferring Across Semantic
Parsing Tasks [37.66114618645146]
We investigate learning representations that facilitate transfer learning from one compositional task to another.
We apply this method to semantic parsing, using three very different datasets.
Our method significantly improves compositional generalization over baselines on the test set of the target task.
arXiv Detail & Related papers (2021-11-09T09:10:21Z) - Transformer-based Dual Relation Graph for Multi-label Image Recognition [56.12543717723385]
We propose a novel Transformer-based Dual Relation learning framework.
We explore two aspects of correlation, i.e., structural relation graph and semantic relation graph.
Our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks.
arXiv Detail & Related papers (2021-10-10T07:14:52Z) - Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
Video-and-Language Inference [81.50675020698662]
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.
We propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions.
We introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
arXiv Detail & Related papers (2021-07-26T15:23:19Z) - A Benchmark for Systematic Generalization in Grounded Language
Understanding [61.432407738682635]
Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts.
Modern neural networks, by contrast, struggle to interpret novel compositions.
We introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding.
arXiv Detail & Related papers (2020-03-11T08:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.