Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
- URL: http://arxiv.org/abs/2507.02946v1
- Date: Sat, 28 Jun 2025 15:24:05 GMT
- Title: Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
- Authors: Chenglin Li, Qianglong Chen, fengtao, Yin Zhang,
- Abstract summary: Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
- Score: 18.027290155746112
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.
Related papers
- TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [26.463523465270097]
Multi-language Large Language Models (MLLMs) have demonstrated significant progress in vision-based tasks, yet they still face challenges when processing long-duration video inputs.<n>We propose Temporal Policy Sampling Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z) - Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding [24.52604124233087]
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks.<n>Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos.<n>TimeSearch integrates two human-like primitives into a unified autoregressive LVLM.
arXiv Detail & Related papers (2025-04-02T06:47:19Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! [22.75945626401567]
We propose a challenging evaluation benchmark named TemporalVQA.<n>The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames.<n>The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years.<n>Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges.
arXiv Detail & Related papers (2025-01-18T06:41:48Z) - TimeRefine: Temporal Grounding with Time Refining Video LLM [75.99665302872901]
Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.<n>We reformulate the temporal grounding task as a temporal refining task.<n>We incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth.
arXiv Detail & Related papers (2024-12-12T18:59:11Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.