Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2602.19040v1
- Date: Tue, 02 Dec 2025 09:52:51 GMT
- Title: Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
- Authors: Jiaxin Wu, Xiao-Yong Wei, Qing Li,
- Abstract summary: We propose an adaptive multi-agent retrieval framework that orchestrates specialized agents over multiple reasoning iterations.<n>Our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
- Score: 12.701443847087164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries and recovering performance for those that degrade over iterations. These agents are dynamically coordinated by an orchestration agent, which leverages intermediate feedback and reasoning outcomes to guide execution. We also introduce a novel communication mechanism that incorporates retrieval-performance memory and historical reasoning traces to improve coordination and decision-making. Experiments on three TRECVid benchmarks spanning eight years show that our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
Related papers
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z) - RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval [99.33724613432922]
We introduce RANKVIDEO, a reasoning-based reranker for video retrieval.<n>RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.<n> Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework.
arXiv Detail & Related papers (2026-02-02T18:40:37Z) - When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning [26.489185170468062]
We propose a novel RL framework called Adaptive Complex Query Optimization (ACQO)<n>Our framework is designed to adaptively determine when and how to expand the search process.<n>ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines.
arXiv Detail & Related papers (2026-01-29T03:16:53Z) - Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion [0.0]
We propose a unified multimodal moment retrieval system with three key innovations.<n>First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval.<n>Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search.<n>Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries.
arXiv Detail & Related papers (2025-12-15T02:50:43Z) - Benefits and Limitations of Communication in Multi-Agent Reasoning [11.788489289062312]
We propose a theoretical framework to analyze the expressivity of multi-agent systems.<n>We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale.<n>Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained.
arXiv Detail & Related papers (2025-10-14T20:04:27Z) - Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations [70.94563079082751]
E-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions.<n>We propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval.<n>Our approach builds on a generative retriever, further augmented with a test-time reranking mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue.
arXiv Detail & Related papers (2025-08-25T15:38:56Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking [3.5291730624600848]
Long-form video understanding presents significant challenges for interactive retrieval systems.<n>Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking.<n>This paper presents a novel framework to enhance interactive video retrieval through four key innovations.
arXiv Detail & Related papers (2025-04-11T09:36:46Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.