Related papers: Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

URL: http://arxiv.org/abs/2511.01617v1
Date: Mon, 03 Nov 2025 14:25:12 GMT
Title: Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Authors: Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan,
Abstract summary: Vote-in-Context (ViC) is a training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task.<n>ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX.
Score: 3.9266376632068485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

Related papers

ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z)
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization [12.200609701777907]
We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer.<n>It converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging.<n>Our pipeline generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme.<n>On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods.
arXiv Detail & Related papers (2025-06-12T15:23:11Z)
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning [37.86612817818566]
We propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames.<n>Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames.<n>This, in turn, leads to improved performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2025-05-31T00:08:21Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z)
Video Referring Expression Comprehension via Transformer with Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language. Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z)
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query) We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query) This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z)
Multi-query Video Retrieval [44.32936301162444]
We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive. We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries. We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
arXiv Detail & Related papers (2022-01-10T20:44:46Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.