Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
- URL: http://arxiv.org/abs/2302.09473v2
- Date: Tue, 17 Oct 2023 22:01:00 GMT
- Title: Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
- Authors: Yimu Wang, Peng Shi
- Abstract summary: We present a novel multi-grained sparse learning framework, S3MA, to learn an sparse space shared between the video and the text for video-text retrieval.
With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses.
Benefiting from the learned shared sparse space and multi-grained similarities, experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods.
- Score: 22.17732989393653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent progress in video-text retrieval has been advanced by the
exploration of better representation learning, in this paper, we present a
novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse
space shared between the video and the text for video-text retrieval. The
shared sparse space is initialized with a finite number of sparse concepts,
each of which refers to a number of words. With the text data at hand, we learn
and update the shared sparse space in a supervised manner using the proposed
similarity and alignment losses. Moreover, to enable multi-grained alignment,
we incorporate frame representations for better modeling the video modality and
calculating fine-grained and coarse-grained similarities. Benefiting from the
learned shared sparse space and multi-grained similarities, extensive
experiments on several video-text retrieval benchmarks demonstrate the
superiority of S3MA over existing methods. Our code is available at
https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Multiple Visual-Semantic Embedding for Video Retrieval from Query
Sentence [8.602553195689513]
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other.
A single space is not enough to accommodate various videos and sentences.
We propose a novel framework that maps instances into multiple individual embedding spaces.
arXiv Detail & Related papers (2020-04-16T21:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.