Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval
- URL: http://arxiv.org/abs/2402.16769v1
- Date: Mon, 26 Feb 2024 17:36:50 GMT
- Title: Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval
- Authors: Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang
Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
- Abstract summary: We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
- Score: 87.69394953339238
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In video-text retrieval, most existing methods adopt the dual-encoder
architecture for fast retrieval, which employs two individual encoders to
extract global latent representations for videos and texts. However, they face
challenges in capturing fine-grained semantic concepts. In this work, we
propose the UNIFY framework, which learns lexicon representations to capture
fine-grained semantics and combines the strengths of latent and lexicon
representations for video-text retrieval. Specifically, we map videos and texts
into a pre-defined lexicon space, where each dimension corresponds to a
semantic concept. A two-stage semantics grounding approach is proposed to
activate semantically relevant dimensions and suppress irrelevant dimensions.
The learned lexicon representations can thus reflect fine-grained semantics of
videos and texts. Furthermore, to leverage the complementarity between latent
and lexicon representations, we propose a unified learning scheme to facilitate
mutual learning via structure sharing and self-distillation. Experimental
results show our UNIFY framework largely outperforms previous video-text
retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and
DiDeMo respectively.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Dual Encoding for Video Retrieval by Text [49.34356217787656]
We propose a dual deep encoding network that encodes videos and queries into powerful dense representations of their own.
Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding.
Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning.
arXiv Detail & Related papers (2020-09-10T15:49:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.