Fugu-MT 論文翻訳(概要): Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

論文の概要: Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

arxiv url: http://arxiv.org/abs/2510.20867v1
Date: Thu, 23 Oct 2025 06:18:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.275134
Title: Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards
Title（参考訳）: Reasoning Process Rewards によるLLMの一貫性, 有効, スケーラブルなReasoning能力のインセンティブ化
Authors: Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth Gurunath Shivakumar, Ivan Bulyko, Ankur Gandhe, Ge Liu, Yile Gu,
Abstract要約: 音声大言語モデルにおけるロバストでスケーラブルな推論法を開発するための原理的手法を開発した。 MMAU 2.5 Pro と GPT-4o Audio をほぼ上回り、MMSU の推論タスクにおけるほぼ人間レベルの性能を向上する。
参考スコア（独自算出の注目度）: 24.40159537923851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The role of reasoning in Audio Large Language Models remains widely underexplored, as introducing a reasoning process often degrades rather than improves performance during inference, a phenomenon we term test-time inverse scaling, where longer reasoning chains yield progressively worse results. We demonstrate that this stems not from fundamental limitations of reasoning itself, but from inadequate training: models without proper guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. To address these challenges, we introduce CESAR (Consistent, Effective, and Scalable Audio Reasoners), shifting from outcome verification to rewarding the reasoning process. Our online reinforcement learning framework employs Group Relative Policy Optimization with a multi-faceted reward suite that incentivizes not only correctness and format but also consistency, structured analytical patterns, causal reasoning, domain-knowledge integration, and calibrated reasoning depth. CESAR resolves test-time inverse scaling, transforming reasoning from detriments into gains while revealing model-specific ``reasoning sweet spots", where performance peaks during test-time scaling. We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks. Through AI-as-judge evaluations and qualitative comparisons, we provide both quantitative and qualitative validation of our improved reasoning quality. Importantly, enhanced reasoning creates synergistic effects, simultaneously improving multimodal reasoning and perception capabilities. Overall, CESAR establishes a principled method for developing robust and scalable reasoning in Audio LLMs.
Abstract（参考訳）: 音声大言語モデルにおける推論の役割は、推論中のパフォーマンスを改善するのではなく、推論プロセスを導入することでしばしば劣化する。推論プロセスの適切なガイダンスのないモデルは、長い連鎖のエラーを蓄積する幻覚的で一貫性のない推論を生み出します。これらの課題に対処するため、CESAR(Consistent, Effective, and Scalable Audio Reasoners)を導入し、結果検証から推論プロセスの報酬へと移行した。我々のオンライン強化学習フレームワークでは、グループ相対政策最適化と、正しさと形式だけでなく、一貫性、構造化された分析パターン、因果推論、ドメイン知識の統合、キャリブレーションされた推論深度を動機付ける多面的な報酬スイートを採用しています。 CESARは、テスト時の逆スケーリングを解決し、デトリメントからゲインへの推論を変換し、テスト時間のスケーリング中にパフォーマンスがピークとなるモデル固有の‘""スイーツスポット"を明らかにした。 MMAU 2.5 Pro と GPT-4o Audio をほぼ上回り、MMSU の推論タスクにおけるほぼ人間レベルの性能を向上する。 AI-as-judge評価と質的比較を通じて、改善した推論品質の定量的および質的検証を提供する。重要なことは、強化された推論は相乗効果を生み出し、同時にマルチモーダル推論と知覚能力を改善することである。全体として、CESARはAudio LLMで堅牢でスケーラブルな推論を開発するための原則的な方法を確立している。

論文の概要: Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

関連論文リスト