RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2602.02444v2
- Date: Tue, 03 Feb 2026 16:18:39 GMT
- Title: RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
- Authors: Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, Reno Kriz,
- Abstract summary: We introduce RANKVIDEO, a reasoning-based reranker for video retrieval.<n>RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.<n> Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework.
- Score: 99.33724613432922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.
Related papers
- Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval [12.701443847087164]
We propose an adaptive multi-agent retrieval framework that orchestrates specialized agents over multiple reasoning iterations.<n>Our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-12-02T09:52:51Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z) - Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives [93.31112073070906]
Existing methods rely on large-scale pre-training to improve video retrieval performance.<n>We propose a novel framework to learn fine-grained features for better alignment.<n>We also introduce an inference pipeline to improve performance without additional training.
arXiv Detail & Related papers (2025-08-20T16:03:56Z) - Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric [1.9774761182870912]
We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach.<n>We conduct experiments on the YouCook2 benchmark, showing promising retrieval performance.
arXiv Detail & Related papers (2025-04-06T18:18:09Z) - Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval [80.09819072780193]
Average Precision (AP) assesses the overall rankings of relevant videos at the top list.
Recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally.
arXiv Detail & Related papers (2024-07-22T11:52:04Z) - Group-aware Contrastive Regression for Action Quality Assessment [85.43203180953076]
We show that the relations among videos can provide important clues for more accurate action quality assessment.
Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.
arXiv Detail & Related papers (2021-08-17T17:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.