Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval
- URL: http://arxiv.org/abs/2510.21806v1
- Date: Tue, 21 Oct 2025 08:48:59 GMT
- Title: Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval
- Authors: Jiaao Yu, Mingjie Han, Tao Gong, Jian Zhang, Man Lan,
- Abstract summary: FDA-CLIP (Frame Difference Alpha-CLIP) is a concise CLIP-based training framework for text-video alignment.<n>The method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel.<n> Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.
- Score: 12.972857287063064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.
Related papers
- Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration [24.337139909108117]
"Less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution.<n>"Visual echoes" yield significant temporal redundancy, which we term 'visual echoes'<n>"AFP" employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives.<n>Our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%.
arXiv Detail & Related papers (2025-08-05T11:31:55Z) - FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding [17.71123451197036]
complexity of video data and contextual processing limitations still hinder long-video comprehension.<n>We propose FiLA-Video, a novel framework that integrates multiple frames into a single representation.<n>FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
arXiv Detail & Related papers (2025-04-29T03:09:46Z) - VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks.<n>Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models.<n>We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv Detail & Related papers (2025-03-24T07:27:19Z) - HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.<n>It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z) - RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter [77.0205013713008]
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries.
To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision models.
We propose a sparse-andcorrelated AdaPter (RAP) to fine-tune the pre-trained model with a few parameterized layers.
arXiv Detail & Related papers (2024-05-29T19:23:53Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.