FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
- URL: http://arxiv.org/abs/2509.24008v3
- Date: Sun, 05 Oct 2025 15:20:25 GMT
- Title: FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
- Authors: Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai,
- Abstract summary: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
- Score: 65.42201665046505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.
Related papers
- VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding [9.415923244280542]
VideoBrain is an end-to-end framework that enables Vision-Language Models to adaptively acquire visual information through learned sampling policies.<n>Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals.
arXiv Detail & Related papers (2026-02-04T00:08:35Z) - Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs [13.306662159600677]
We introduce video QFrame, a novel approach for adaptive frame selection and multi-temporal scaling.<n>Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP.<n>We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2025-06-27T11:30:51Z) - ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding [52.050036778325094]
We introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework.<n>ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames.<n>Our approach consistently improves reasoning performance across multiple video QA benchmarks.
arXiv Detail & Related papers (2025-06-02T03:08:07Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection [51.004020874336284]
VidTFS is a Training-free, open-vocabulary video goal and action inference framework.
Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly.
We validate the performance of the proposed VidTFS on four widely used video datasets.
arXiv Detail & Related papers (2024-01-23T03:45:05Z) - Video Dynamics Prior: An Internal Learning Approach for Robust Video
Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus.
Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z) - Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.