Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning
- URL: http://arxiv.org/abs/2203.13049v2
- Date: Mon, 28 Mar 2022 14:22:18 GMT
- Title: Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning
- Authors: Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu,
Yi Yang, Yueting Zhuang, Xin Eric Wang
- Abstract summary: Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
- Score: 92.07643510310766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal grounding in videos aims to localize one target video segment that
semantically corresponds to a given query sentence. Thanks to the semantic
diversity of natural language descriptions, temporal grounding allows activity
grounding beyond pre-defined classes and has received increasing attention in
recent years. The semantic diversity is rooted in the principle of
compositionality in linguistics, where novel semantics can be systematically
described by combining known words in novel ways (compositional
generalization). However, current temporal grounding datasets do not
specifically test for the compositional generalizability. To systematically
measure the compositional generalizability of temporal grounding models, we
introduce a new Compositional Temporal Grounding task and construct two new
dataset splits, i.e., Charades-CG and ActivityNet-CG. Evaluating the
state-of-the-art methods on our new dataset splits, we empirically find that
they fail to generalize to queries with novel combinations of seen words. To
tackle this challenge, we propose a variational cross-graph reasoning framework
that explicitly decomposes video and language into multiple structured
hierarchies and learns fine-grained semantic correspondence among them.
Experiments illustrate the superior compositional generalizability of our
approach. The repository of this work is at https://github.com/YYJMJC/
Compositional-Temporal-Grounding.
Related papers
- SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding [52.98133831401225]
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence.
We propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo.
We introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries.
arXiv Detail & Related papers (2024-07-06T16:08:17Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Learning to Generalize Compositionally by Transferring Across Semantic
Parsing Tasks [37.66114618645146]
We investigate learning representations that facilitate transfer learning from one compositional task to another.
We apply this method to semantic parsing, using three very different datasets.
Our method significantly improves compositional generalization over baselines on the test set of the target task.
arXiv Detail & Related papers (2021-11-09T09:10:21Z) - A Neural Generative Model for Joint Learning Topics and Topic-Specific
Word Embeddings [42.87769996249732]
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings.
The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy.
arXiv Detail & Related papers (2020-08-11T13:54:11Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z) - A Benchmark for Systematic Generalization in Grounded Language
Understanding [61.432407738682635]
Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts.
Modern neural networks, by contrast, struggle to interpret novel compositions.
We introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding.
arXiv Detail & Related papers (2020-03-11T08:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.