Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
- URL: http://arxiv.org/abs/2601.23224v1
- Date: Fri, 30 Jan 2026 17:47:30 GMT
- Title: Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
- Authors: Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, Ziang Yan, Yi Wang, Hongjie Zhang, Yali Wang, Limin Wang,
- Abstract summary: Video-o3 is a novel framework that supports iterative discovery of salient visual clues.<n>Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes.
- Score: 28.87800134659646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.
Related papers
- VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z) - VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos [44.23732277782877]
In long-video understanding, uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations.<n>Recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline.<n>We propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering.
arXiv Detail & Related papers (2026-02-08T03:45:50Z) - Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning [54.9540824532312]
We propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system.<n>We show that Weaver enhances performance on several complex video reasoning benchmarks.
arXiv Detail & Related papers (2026-02-05T16:19:41Z) - LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z) - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - An Empirical Study on How Video-LLMs Answer Video Questions [41.97630658989303]
Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions.<n>To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content.
arXiv Detail & Related papers (2025-08-21T08:42:35Z) - VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering [14.039561301034848]
Cross-video question answering presents significant challenges beyond traditional single-video understanding.<n>We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning.<n>Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training.
arXiv Detail & Related papers (2025-08-05T03:33:24Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - AntPivot: Livestream Highlight Detection via Hierarchical Attention
Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem.
We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z) - Unsupervised Video Summarization with a Convolutional Attentive
Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way.
Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores.
The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2021-05-24T07:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.