FuguReport

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

Authors Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie
Affiliations Harbin Institute of Technology / Shandong University
Categories Method / Video Retrieval / Reasoning-guided recalling and reranking, Task / Zero-Shot Retrieval / Zero-shot composed video search, Evaluation / Model Evaluation / Effectiveness demonstration
License CC BY 4.0

Abstract Overview

This paper studies zero-shot composed video retrieval, where a system must find a target video from a gallery given a source video and a textual edit instruction. The authors argue that standard embedding retrieval is efficient but can miss implicit consequences of the edit, while exhaustive pairwise reranking is too expensive for large galleries. They propose R^3, an inference-time pipeline that first generates a target-oriented reasoning trace, then uses both the original and reasoning-augmented queries for retrieval, and finally reranks only the recalled candidates. The system is built from frozen Qwen3-VL components and is designed as a coarse-to-fine retrieval program rather than a task-specific trained model.

Novelty

The distinctive idea is to place a generated reasoning trace before retrieval and use it as a controlled query-expansion signal rather than only as an explanation. The paper also introduces an agreement-gated residual fusion rule so that reasoning can influence retrieval when aligned with the base query, without overriding the original source-edit condition.

Results

The experiments show that reasoning-guided recall gives a modest improvement over the embedding baseline, while reranking contributes the main gain in top-1 accuracy. In the reported official results, the method reaches 95.44 R@1 on validation and 98.82 R@1 on test, with test R@5 through R@50 all at 100.00. The ablation discussion attributes a +0.34 R@1 gain to reasoning and a further +3.70 R@1 gain to reranking.

Key Points

  1. R^3 frames composed video retrieval as a reasoning-guided coarse-to-fine pipeline with separate reasoning, recall, and reranking stages.
  2. The method uses frozen Qwen3-VL models, generating a target-side reasoning paragraph and combining base and reasoning-augmented retrieval scores through agreement-gated residual fusion.
  3. Empirically, reasoning offers a small recall benefit, whereas pairwise reranking is the main driver of improved top-ranked retrieval performance.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.