Fugu-MT 論文翻訳(概要): Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

論文の概要: Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

arxiv url: http://arxiv.org/abs/2605.27741v1
Date: Tue, 26 May 2026 22:34:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.584285
Title: Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
Title（参考訳）: 先行する言語をエスケープする: モーダリティ・アウェア・ポリシー最適化による音声推論における後期モーダリティの崩壊の軽減
Authors: Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo,
Abstract要約: 両ブランチ強化学習フレームワークであるMAPO(Modality-Aware Policy Optimization)を紹介する。まず、MAPOはモダリティクリティカルトークンのポリシー勾配を動的に集中させる。第二に、モデルの内部の注意分布にターゲットを絞った時間スケールのペナルティを適用する、補助的な注意損失ブランチを統合する。
参考スコア（独自算出の注目度）: 40.86280811828235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.
Abstract（参考訳）: オーディオおよびオムニモーダルな大言語モデルは、印象的なクロスモーダルな推論能力を示す。しかし、これらのモデルに標準的な強化学習のポストトレーニングアルゴリズムを適用すると、重要な構造的脆弱性が明らかになる。このことは後段のモダリティの崩壊を悪化させ、モデルが圧縮されたテキストの先行を優先してプライマリ・ソース・シグナルを徐々に放棄し、自信あるが根拠のない幻覚へと繋がる。そこで本研究では,新しい二分岐強化学習フレームワークであるMAPO(Modality-Aware Policy Optimization)を紹介する。まず、MAPOは、モーダリティ関連マスクを用いて、モーダリティクリティカルトークンのポリシー勾配を動的に集中させる。第二に、モデルの内部の注意分布にターゲットを絞った時間スケールのペナルティを適用する、補助的な注意損失ブランチを統合する。これにより、モデルは、推論トレースの奥深くへのクロスモーダルグラウンドを積極的に維持できる。複雑な音声推論ベンチマークの評価は、MAPOが長軸推論の忠実度とマルチモーダル命令を著しく改善し、高い競争性能を達成し、オープンウェイトモデルのいくつかの重要なベンチマークに新しい最先端結果を設定することを実証している。ドメイン固有の帰納バイアスではなく、ネイティブな統計信号に厳密に依存することにより、MAPOは多様なマルチモーダルシステム間でのてんかんの崩壊を緩和するための有望な基盤を提供する。

論文の概要: Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

関連論文リスト