Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval
- URL: http://arxiv.org/abs/2110.15609v1
- Date: Fri, 29 Oct 2021 08:23:40 GMT
- Title: Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval
- Authors: Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, Hao
Chen
- Abstract summary: Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
- Score: 17.443195531553474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of cross-modal retrieval between texts and videos aims to understand
the correspondence between vision and language. Existing studies follow a trend
of measuring text-video similarity on the basis of textual and video
embeddings. In common practice, video representation is constructed by feeding
video frames into 2D/3D-CNN for global visual feature extraction or only
learning simple semantic relations by using local-level fine-grained frame
regions via graph convolutional network. However, these video representations
do not fully exploit spatio-temporal relation among visual components in
learning video representations, resulting in their inability to distinguish
videos with the same visual components but with different relations. To solve
this problem, we propose a Visual Spatio-temporal Relation-enhanced Network
(VSR-Net), a novel cross-modal retrieval framework that enhances visual
representation with spatio-temporal relations among components. Specifically,
visual spatio-temporal relations are encoded using a multi-layer
spatio-temporal transformer to learn visual relational features. We combine
fine-grained local relation and global features in bridging text-video
modalities. Extensive experimental are conducted on both MSR-VTT and MSVD
datasets. The results demonstrate the effectiveness of our proposed model.
Related papers
- Video-Language Alignment via Spatio-Temporal Graph Transformer [26.109883502401885]
Video-language alignment is a crucial task that benefits various downstream applications, e.g., video-text retrieval and question answering.
We propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training.
arXiv Detail & Related papers (2024-07-16T12:52:32Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.