Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos
- URL: http://arxiv.org/abs/2303.01046v1
- Date: Thu, 2 Mar 2023 08:00:22 GMT
- Title: Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos
- Authors: Daizong Liu, Pan Zhou
- Abstract summary: We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
- Score: 67.12603318660689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence localization in videos (TSLV) aims to retrieve the most
interested segment in an untrimmed video according to a given sentence query.
However, almost of existing TSLV approaches suffer from the same limitations:
(1) They only focus on either frame-level or object-level visual representation
learning and corresponding correlation reasoning, but fail to integrate them
both; (2) They neglect to leverage the rich semantic contexts to further
benefit the query reasoning. To address these issues, in this paper, we propose
a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN),
which enables both visual- and semantic-aware query reasoning from object-level
to frame-level. Specifically, we present a new graph memory mechanism to
perform visual-semantic query reasoning: For visual reasoning, we design a
visual graph memory to leverage visual information of video; For semantic
reasoning, a semantic graph memory is also introduced to explicitly leverage
semantic knowledge contained in the classes and attributes of video objects,
and perform correlation reasoning in the semantic space. Experiments on three
datasets demonstrate that our HVSARN achieves a new state-of-the-art
performance.
Related papers
- Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z) - Exploiting Visual Semantic Reasoning for Video-Text Retrieval [14.466809435818984]
We propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions.
We perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations.
With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed.
arXiv Detail & Related papers (2020-06-16T02:56:46Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.