Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning
- URL: http://arxiv.org/abs/2401.00701v1
- Date: Mon, 1 Jan 2024 08:54:18 GMT
- Title: Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning
- Authors: Kaibin Tian and Yanhua Cheng and Yi Liu and Xinglin Hou and Quan Chen
and Han Li
- Abstract summary: We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
- Score: 15.998149438353133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, text-to-video retrieval methods based on CLIP have
experienced rapid development. The primary direction of evolution is to exploit
the much wider gamut of visual and textual cues to achieve alignment.
Concretely, those methods with impressive performance often design a heavy
fusion block for sentence (words)-video (frames) interaction, regardless of the
prohibitive computation complexity. Nevertheless, these approaches are not
optimal in terms of feature utilization and retrieval efficiency. To address
this issue, we adopt multi-granularity visual feature learning, ensuring the
model's comprehensiveness in capturing visual content features spanning from
abstract to detailed levels during the training phase. To better leverage the
multi-granularity features, we devise a two-stage retrieval architecture in the
retrieval phase. This solution ingeniously balances the coarse and fine
granularity of retrieval content. Moreover, it also strikes a harmonious
equilibrium between retrieval effectiveness and efficiency. Specifically, in
training phase, we design a parameter-free text-gated interaction block (TIB)
for fine-grained video representation learning and embed an extra Pearson
Constraint to optimize cross-modal representation learning. In retrieval phase,
we use coarse-grained video representations for fast recall of top-k
candidates, which are then reranked by fine-grained video representations.
Extensive experiments on four benchmarks demonstrate the efficiency and
effectiveness. Notably, our method achieves comparable performance with the
current state-of-the-art methods while being nearly 50 times faster.
Related papers
- Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding [11.211803499867639]
We propose DYTO, a novel dynamic token merging framework for zero-shot video understanding.
DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences.
Experiments demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods.
arXiv Detail & Related papers (2024-11-21T18:30:11Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.