Exploiting Visual Semantic Reasoning for Video-Text Retrieval
- URL: http://arxiv.org/abs/2006.08889v1
- Date: Tue, 16 Jun 2020 02:56:46 GMT
- Title: Exploiting Visual Semantic Reasoning for Video-Text Retrieval
- Authors: Zerun Feng, Zhimin Zeng, Caili Guo, Zheng Li
- Abstract summary: We propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions.
We perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations.
With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed.
- Score: 14.466809435818984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video retrieval is a challenging research topic bridging the vision and
language areas and has attracted broad attention in recent years. Previous
works have been devoted to representing videos by directly encoding from
frame-level features. In fact, videos consist of various and abundant semantic
relations to which existing methods pay less attention. To address this issue,
we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit
reasoning between frame regions. Specifically, we consider frame regions as
vertices and construct a fully-connected semantic correlation graph. Then, we
perform reasoning by novel random walk rule-based graph convolutional networks
to generate region features involved with semantic relations. With the benefit
of reasoning, semantic interactions between regions are considered, while the
impact of redundancy is suppressed. Finally, the region features are aggregated
to form frame-level features for further encoding to measure video-text
similarity. Extensive experiments on two public benchmark datasets validate the
effectiveness of our method by achieving state-of-the-art performance due to
the powerful semantic reasoning.
Related papers
- Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.