Related papers: PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

URL: http://arxiv.org/abs/2601.13797v1
Date: Tue, 20 Jan 2026 09:57:04 GMT
Title: PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
Authors: Gabriele Serussi, David Vainshtein, Jonathan Kouchly, Dotan Di Castro, Chaim Baskin,
Abstract summary: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text.<n>Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs)<n>We introduce PREGEN, an efficient and powerful CoVR framework that overcomes these limitations.
Score: 9.493866391853723
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.

Related papers

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z)
X-Aligner: Composed Visual Retrieval without the Bells and Whistles [5.3880484326593745]
We propose a novel Composed Video Retrieval (CoVR) framework that leverages the representational power of Vision Language Models (VLMs)<n>Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs.<n>Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
arXiv Detail & Related papers (2026-01-23T09:33:38Z)
Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval [9.243219818283263]
Video-text retrieval (VTR) aims to locate relevant videos using natural language queries.<n>Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features.<n>We introduce the HVP-Net, a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder.
arXiv Detail & Related papers (2026-01-19T06:55:33Z)
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos [48.666667545084835]
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change.<n>We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR.<n> TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving.
arXiv Detail & Related papers (2025-06-05T17:31:17Z)
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts [19.81035705650859]
We introduce LoVR, a benchmark specifically designed for long video-text retrieval.<n>LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions.<n>Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset.
arXiv Detail & Related papers (2025-05-20T04:49:09Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration [73.70209718408641]
SeedVR is a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution.<n>It achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos.
arXiv Detail & Related papers (2025-01-02T16:19:48Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.