VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
- URL: http://arxiv.org/abs/2510.06040v1
- Date: Tue, 07 Oct 2025 15:34:46 GMT
- Title: VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
- Authors: Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, Yutong Gao,
- Abstract summary: VideoMiner learns to understand hour-long videos with multi-modal large language models (MM-LLMs)<n>To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method.<n>Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain.
- Score: 13.234970097206487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.
Related papers
- VideoLucy: Deep Memory Backtracking for Long Video Understanding [102.37736560263649]
We propose VideoLucy, a deep memory backtracking framework for long video understanding.<n>Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity.<n>VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks.
arXiv Detail & Related papers (2025-10-14T11:59:19Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [79.10678768386752]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding [33.58579390725519]
Video-MTR is a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension.<n>Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns.<n>To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system.
arXiv Detail & Related papers (2025-08-28T06:55:08Z) - Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z) - Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - Infinite Video Understanding [50.78256932424239]
We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia.<n>We outline the core challenges and key research directions towards achieving this transformative capability.
arXiv Detail & Related papers (2025-07-11T23:07:04Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos [67.78336281317347]
Long-form understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information.<n>We propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos.
arXiv Detail & Related papers (2024-05-29T15:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.