Related papers: VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

URL: http://arxiv.org/abs/2602.08099v1
Date: Sun, 08 Feb 2026 19:39:32 GMT
Title: VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
Authors: Issar Tzachor, Dvir Samuel, Rami Ben-Ari,
Abstract summary: This paper focuses on leveraging MLLMs for video-text embedding and retrieval.<n>We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information.<n>We demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training.
Score: 11.519642157641023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.

Related papers

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation [32.57236582010967]
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data.<n>We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference.<n>We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation.
arXiv Detail & Related papers (2025-12-18T18:52:18Z)
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs [4.726627693005334]
Multimodal large language models have recently achieved remarkable progress in video question answering.<n>It remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency.
arXiv Detail & Related papers (2025-10-14T09:02:22Z)
A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization [12.200609701777907]
We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer.<n>It converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging.<n>Our pipeline generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme.<n>On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods.
arXiv Detail & Related papers (2025-06-12T15:23:11Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval.<n>Uniquely, it provides manually separated spatial annotations and temporal annotations for each video.<n>Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z)
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Video Understanding with Large Language Models: A Survey [107.7736911322462]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.<n>The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.<n>This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.