Fugu-MT 論文翻訳(概要): Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

論文の概要: Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

arxiv url: http://arxiv.org/abs/2604.24191v1
Date: Mon, 27 Apr 2026 08:52:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.86533
Title: Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning
Title（参考訳）: Omni-o3:Deliberative Audio-Visual ReasoningのためのDeep Nested Omnimodal Deduction
Authors: Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang,
Abstract要約: Omni-o3は、深くネストした推論ポリシーによって駆動される新しいフレームワークである。本研究では,(1)101K級高品位長鎖トラジェクターの冷間開始制御による微調整,(2)18K級複合多ターン試料のネスト群ロールアウト駆動型探索強化学習を提案する。実験によると、Omni-o3は11のベンチマークで競合性能を達成し、包括的なオーディオ視覚、視覚中心、オーディオ中心の推論タスクの高度な能力を解放している。
参考スコア（独自算出の注目度）: 46.183380117488014
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
Abstract（参考訳）: オムニモーダル理解は、集中的で熟考的な推論を必要とするクロスモーダル相互作用の巨大な、非常に冗長な検索空間を必要とする。現在の推論パラダイムは、シーケンシャルなステップバイステップ生成またはパラレルなサンプルバイサンプルロールアウトのいずれかに依存しており、分離された推論軌道につながっている。この有望な中間経路を共有できないことは、探索効率を著しく制限し、複雑なオーディオ視覚タスクにおいて複合的なエラーを引き起こす。このボトルネックを克服するために、我々は、深くネストされた推論ポリシーによって駆動される新しいフレームワークであるOmni-o3を紹介します。推論を動的再帰的な探索として定式化することで、Omni-o3は本質的に枝にまたがる推論の接頭辞を共有し、拡張、選択、シミュレーション、バックプロパゲーションの4つの原子認知アクションを反復的に実行することができる。この枠組みを強化するために,(1)高品質の長鎖軌道を3.5Mの雑多なサンプルから抽出し,必要な再帰的な探索パターンを実現するコールドスタート制御による2段階の訓練パラダイムを提案し,(2)複雑な18Kサンプルに対するネストグループロールアウト駆動探索強化学習を,新しいマルチステップ報酬モデルで明示的に指導し,深いネスト推論を刺激する。大規模な実験により、Omni-o3は11のベンチマークで競合性能を達成し、包括的なオーディオ視覚、ビジュアル中心、オーディオ中心の推論タスクにおける高度な機能を解放した。

論文の概要: Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

関連論文リスト