Fugu-MT 論文翻訳(概要): Simple o3: Towards Interleaved Vision-Language Reasoning

論文の概要: Simple o3: Towards Interleaved Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2508.12109v1
Date: Sat, 16 Aug 2025 17:15:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.573337
Title: Simple o3: Towards Interleaved Vision-Language Reasoning
Title（参考訳）: シンプルなo3: インターリーブされたビジョンランゲージ推論を目指して
Authors: Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shijie Guo, Zhirui Zhang, Zhongyu Wei,
Abstract要約: 我々は、動的ツールインタラクションをインターリーブされた視覚言語推論に統合する、エンドツーエンドのフレームワークであるSimple o3を提案する。提案手法は,高品質な視覚言語推論チェーンを生成するスケーラブルなデータ合成パイプラインを特徴とする。実験の結果、Simple o3は様々なベンチマークで優れたパフォーマンスを示し、既存のアプローチよりも優れています。
参考スコア（独自算出の注目度）: 38.46230601239066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚言語タスクにおいて顕著なパフォーマンスを示しているが、マルチモーダルシナリオにおけるCoT(Chain-of-Thought)の長きにわたっての能力は、まだ未定である。ヒューマンライクな「イメージ思考」を反復的な視覚的変換と言語的推論を通じてエミュレートするOpenAIのo3モデルに着想を得て,動的ツールインタラクション(例えば,トリミング,ズーム,リユース)を教師付き微調整(SFT)によるインターリーブド視覚言語推論に統合する,エンドツーエンドのフレームワークであるSimple o3を提案する。当社のアプローチでは,高品質な視覚言語推論チェーンを生成するスケーラブルなデータ合成パイプラインを'オブザーバ・リアソン・アクティベート'サイクルで構築し,実行可能な視覚操作と厳密な検証を実現し,オープンソースのTWI-Tools-146Kデータセットを生成する。実験の結果、Simple o3は様々なベンチマークで優れたパフォーマンスを示し、既存のアプローチよりも優れています。拡張された推論機能を組み合わせることで、Simple o3はマルチモーダル推論を前進させる強力な計算可能なパラダイムを確立します。注目すべきは、さまざまなインターリーブド推論戦略の詳細な分析を初めて提供し、モデルパフォーマンスへの影響についての洞察を提供することです。視覚的推論に新たな視覚トークンを導入することで、元の画像の再利用と拡大により、モデルの視覚的推論ときめ細かな知覚が大幅に改善され、一方、正確な視覚的グラウンドティングに基づく画像トリミングにより、モデルが重要なエンティティや領域に効果的にフォーカスでき、その能力をさらに高めることができることがわかった。

論文の概要: Simple o3: Towards Interleaved Vision-Language Reasoning

関連論文リスト