Fugu-MT 論文翻訳(概要): Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

論文の概要: Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

arxiv url: http://arxiv.org/abs/2510.20579v1
Date: Thu, 23 Oct 2025 14:05:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:18.036966
Title: Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Title（参考訳）: Open-o3ビデオ: 露骨な時空間エビデンスによる地上ビデオの推論
Authors: Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang,
Abstract要約: 我々は、明示的な証拠をビデオ推論に統合する非エージェントフレームワークであるOpen-o3 Videoを紹介した。このモデルは、キーオブジェクトとバウンディングボックスをその答えとともに強調し、推論を具体的な視覚的な観察で基礎付けることができる。 V-STARベンチマークでは、Open-o3 Videoは最先端のパフォーマンスを達成し、mAMを14.4%、mLタイムスタンプを24.2%向上させた。
参考スコア（独自算出の注目度）: 70.2803680525165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
Abstract（参考訳）: ほとんどのビデオ推論モデルは、いつ、どこで重要な証拠が現れるかを示すことなく、テキストによる推論トレースを生成する。 OpenAI-o3のような最近のモデルでは、画像に対するエビデンス中心の推論に広く関心が寄せられているが、この能力をビデオに拡張することはより困難である。我々は、ビデオ推論に露骨な時空間証拠を統合する非エージェントフレームワークであるOpen-o3 Videoを紹介し、上記の課題に対処するためのトレーニングデータとデザイントレーニング戦略を慎重に収集する。このモデルでは、応答とともに重要なタイムスタンプ、オブジェクト、バウンディングボックスをハイライトし、推論を具体的な視覚的観察で基礎付けることができる。この機能を実現するために、我々はまず2つの高品質データセットSTGR-CoT-30k for SFTとSTGR-RL-36k for RLを、時間的および空間的アノテーションを慎重に構築する。次に,解答精度,時間的アライメント,空間的精度を協調的に促進する,複数の特別に設計された報酬を用いた冷間開始強化学習戦略を採用する。 V-STARベンチマークでは、Open-o3 Videoは最先端のパフォーマンスを達成し、Qwen2.5-VLベースラインでmAMを14.4%、mLGMを24.2%向上させた。 VideoMME、WorldSense、VideoMMMU、TVGBenchなど、幅広いビデオ理解ベンチマークでも、一貫性のある改善が観察されている。 Open-o3 Videoが生成した推論トレースは、正確性以外にも、テストタイムのスケーリングに有用なシグナルを提供し、信頼性を認識した検証を可能にし、回答の信頼性を向上させる。

論文の概要: Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

関連論文リスト