Fugu-MT 論文翻訳(概要): LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

論文の概要: LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arxiv url: http://arxiv.org/abs/2605.22012v1
Date: Thu, 21 May 2026 05:18:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.104776
Title: LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Title（参考訳）: LatentOmni: 統一オーディオ・ビジュアル・レイトレント推論によるOmni-Modal理解の再考
Authors: Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang,
Abstract要約: 自己回帰生成と互換性を維持しつつ、高密度な感覚情報を保存するため、統合された潜伏空間は、そのような推論のためのより良い媒体である、と我々は主張する。この知見に基づいて,テキスト推論と音声視覚的潜在状態の相互関係を持つクロスモーダル推論フレームワークである textbfLatent Omni を提案する。
参考スコア（独自算出の注目度）: 31.98142661908727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Abstract（参考訳）: 共同音声・視覚的推論は、一様理解には不可欠であるが、現在のマルチモーダル・大規模言語モデル(MLLM)は、両方のモダリティからきめ細かい証拠を必要とする場合に依然として苦戦している。中心的な制限は、明示的なテキストベースのチェーン・オブ・シークレット(CoT)が、連続した音声視覚信号を離散トークンに圧縮し、時間的接地を弱め、中間的推論を言語先行に向けてシフトさせることである。自己回帰生成と互換性を維持しつつ、高密度な感覚情報を保存するため、統合された潜伏空間は、そのような推論のためのより良い媒体である、と我々は主張する。この知見に基づいて,音声・視覚的潜伏状態とテキスト推論を相互に伝達するクロスモーダル推論フレームワークである「textbf{LatentOmni}」を提案する。 LatentOmniは、潜時推論状態をタスク関連感覚機能と整合させる機能レベルの監視を導入し、Omni-Sync Position Embedding (OSPE)を使用して潜時オーディオと視覚状態間の時間的一貫性を維持する。さらに,遅延空間推論を監督するための音声-視覚間干渉推論トラジェクトリのデータセットである「textbf{LatentOmni-Instruct-35K}」を構築した。複数の音声-視覚的推論ベンチマークによる包括的評価は、LatentOmniが評価済みのオープンソースモデルの中で最高のパフォーマンスを達成し、拡張テキストCoTベースラインを一貫して上回り、より強力な全方位理解への有望な道としてラテント空間共同推論をサポートすることを証明している。

論文の概要: LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

関連論文リスト