TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
- URL: http://arxiv.org/abs/2511.05489v1
- Date: Fri, 07 Nov 2025 18:58:25 GMT
- Title: TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
- Authors: Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She,
- Abstract summary: Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query.<n>TimeSearch-R reformulates temporal search as interleaved text-video thinking.<n>It seamlessly integrates searching video clips into the reasoning process through reinforcement learning.
- Score: 30.895558843927862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.
Related papers
- MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation [2.819801450768979]
We introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh.<n>Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments.<n>A Google Image Search-based fallback module expands query representations with external web imagery.
arXiv Detail & Related papers (2025-12-15T02:25:46Z) - Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z) - MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z) - T*: Re-thinking Temporal Search for Long-Form Video Understanding [66.72243342954823]
Current temporal search methods only achieve 2.1% temporal F1 score on the Longvideobench subset.<n>Inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search.<n>Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding.
arXiv Detail & Related papers (2025-04-03T04:03:10Z) - TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding [24.52604124233087]
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks.<n>Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos.<n>TimeSearch integrates two human-like primitives into a unified autoregressive LVLM.
arXiv Detail & Related papers (2025-04-02T06:47:19Z) - Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [50.419872452397684]
Search-R1 is an extension of reinforcement learning for reasoning frameworks.<n>It generates search queries during step-by-step reasoning with real-time retrieval.<n>It improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines.
arXiv Detail & Related papers (2025-03-12T16:26:39Z) - Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - A Flexible and Scalable Framework for Video Moment Search [51.47907684209207]
This paper introduces a flexible framework for retrieving a ranked list of moments from collection of videos in any length to match a text query.<n>Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking.<n> Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time.
arXiv Detail & Related papers (2025-01-09T08:54:19Z) - T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval [30.48217069475297]
We introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers.
T2VIndexer aims to reduce retrieval time while maintaining high accuracy.
arXiv Detail & Related papers (2024-08-21T08:40:45Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.