Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
- URL: http://arxiv.org/abs/2512.14273v1
- Date: Tue, 16 Dec 2025 10:34:39 GMT
- Title: Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
- Authors: Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma,
- Abstract summary: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question.<n>We present Zoom-Zero, a framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification.
- Score: 80.03914556721519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.
Related papers
- SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM [36.28285195488772]
Large language models (LLMs) have demonstrated exceptional capabilities in text understanding.<n>Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information.<n>This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding.
arXiv Detail & Related papers (2026-02-03T14:39:16Z) - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z) - Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z) - Dense Video Understanding with Gated Residual Tokenization [49.17263029080152]
High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
arXiv Detail & Related papers (2025-09-17T17:34:40Z) - ResidualViT for Efficient Temporally Dense Video Encoding [66.57779133786131]
We make three contributions to reduce the cost of computing features for temporally dense tasks.<n>First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos.<n>Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model.
arXiv Detail & Related papers (2025-09-16T17:12:23Z) - Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.