Fugu-MT 論文翻訳(概要): Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

論文の概要: Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

arxiv url: http://arxiv.org/abs/2604.18187v1
Date: Mon, 20 Apr 2026 12:43:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.866383
Title: Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
Title（参考訳）: Audio-DeepThinker: 音声言語モデルにおける高品質連鎖生成のためのプログレッシブ・推論・アウェア強化学習
Authors: Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, Dong Yu,
Abstract要約: 既存の音声推論の手法は、教師付きチェーン・オブ・ファインチューニングや強化学習に依存している。本稿では,2つの中核的アイデアに基づくフレームワークであるAudio-DeepThinkerを提案する。ステージ1は基本的音響QAを訓練し、基本的推論パターンを育成し、ステージ2は音響的に挑戦的な境界ケースにシフトする。 Audio-DeepThinkerはMMAR(74.0%)、MMAU-test-mini(78.5%)、MMSU(77.26%)の最先端結果を達成する
参考スコア（独自算出の注目度）: 33.669071786618495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.
Abstract（参考訳）: LALM(Large Audio-Language Models)は、音声理解において大きな進歩を遂げているが、主に知覚・答えシステムとして機能する。既存の音声推論の手法は、訓練データの品質に制限される教師付きチェーン・オブ・シンク(CoT)微調整や、推論品質を直接評価しない粗い報酬を伴う強化学習(RL)に依存している。その結果、生成した推論鎖はよく構造が整ったように見えるが、特定の音響的接地が欠如している。本稿では,2つの中核的アイデアに基づくフレームワークであるAudio-DeepThinkerを提案する。まず、論理経路アライメント、キーステップカバレッジ、分析深度を評価するLLM評価器と、参照推論チェーンとのセマンティックアライメントを強制する埋め込み類似度コンポーネントを組み合わせることで、生成された推論チェーンの品質を直接監視するハイブリッド推論類似性報酬を導入する。第2に,事前のチェーン・オブ・シークレット能力を持たない命令調整モデルから,教師付き推論の微調整なしに,純粋なRL探索によって高品質なCoT推論が実現可能な2段階のカリキュラムを提案する。第1ステージは基本的推論パターンを育むために基本的音響QAを訓練し、第2ステージはLLMのみの報奨で音響的に挑戦する境界ケースに移行し、より多様な推論を行う。 Audio-DeepThinkerはMMAR(74.0%)、MMAU-test-mini(78.5%)、MMSU(77.26%)で最先端の成績を収め、Interspeech 2026 Audio Reasoning Challenge(シングルモデルトラック)で1位を獲得した。解釈可能性分析により、RLトレーニングは上層のMoEゲーティング機構を主に再認識し、推論トークンが上層のトランスフォーマー層で徐々に結晶化し、探索を通じてオーディオ推論がどのように現れるかの機械学的洞察を提供することが明らかとなった。

論文の概要: Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

関連論文リスト