PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
- URL: http://arxiv.org/abs/2601.13797v1
- Date: Tue, 20 Jan 2026 09:57:04 GMT
- Title: PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
- Authors: Gabriele Serussi, David Vainshtein, Jonathan Kouchly, Dotan Di Castro, Chaim Baskin,
- Abstract summary: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text.<n>Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs)<n>We introduce PREGEN, an efficient and powerful CoVR framework that overcomes these limitations.
- Score: 9.493866391853723
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.
Related papers
- CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z) - X-Aligner: Composed Visual Retrieval without the Bells and Whistles [5.3880484326593745]
We propose a novel Composed Video Retrieval (CoVR) framework that leverages the representational power of Vision Language Models (VLMs)<n>Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs.<n>Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
arXiv Detail & Related papers (2026-01-23T09:33:38Z) - Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval [9.243219818283263]
Video-text retrieval (VTR) aims to locate relevant videos using natural language queries.<n>Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features.<n>We introduce the HVP-Net, a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder.
arXiv Detail & Related papers (2026-01-19T06:55:33Z) - From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos [48.666667545084835]
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change.<n>We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR.<n> TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving.
arXiv Detail & Related papers (2025-06-05T17:31:17Z) - LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts [19.81035705650859]
We introduce LoVR, a benchmark specifically designed for long video-text retrieval.<n>LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions.<n>Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset.
arXiv Detail & Related papers (2025-05-20T04:49:09Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration [73.70209718408641]
SeedVR is a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution.<n>It achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos.
arXiv Detail & Related papers (2025-01-02T16:19:48Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.