Fugu-MT 論文翻訳(概要): Step-Audio-R1 Technical Report

論文の概要: Step-Audio-R1 Technical Report

arxiv url: http://arxiv.org/abs/2511.15848v1
Date: Wed, 19 Nov 2025 20:12:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-21 17:08:52.361595
Title: Step-Audio-R1 Technical Report
Title（参考訳）: Step-Audio-R1テクニカルレポート
Authors: Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu,
Abstract要約: 本稿では,音声領域における推論能力の解放に成功した最初の音声推論モデルであるStep-Audio-R1を紹介する。私たちのモデルは、Gemini 2.5 Proを抜いて、最先端のGemini 3 Proに匹敵するパフォーマンスを実現した、強力なオーディオ推論能力を示しています。
参考スコア（独自算出の注目度）: 70.37077572409319
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
Abstract（参考訳）: 推論モデルの最近の進歩は、拡張されたチェーン・オブ・シークレットの熟考を通じて、テキストや視覚領域において顕著な成功を収めている。しかし、難解な現象は、音声言語モデルに持続する:彼らは最小か無の推論で一貫して良く機能し、根本的な疑問を提起する - オーディオインテリジェンスは、思慮深い思考から真に利益を得ることができるか? 本稿では,音声領域における推論能力の解放に成功した最初の音声推論モデルであるStep-Audio-R1を紹介する。提案したModality-Grounded Reasoning Distillation (MGRD) フレームワークを通じて、Step-Audio-R1は、無関係な議論を幻覚させるのではなく、音響的特徴に真に根ざしたオーディオ関連推論連鎖を生成することを学ぶ。我々のモデルは、声、環境音、音楽にまたがる総合的な音声理解と推論のベンチマークで、Gemini 2.5 Proを抜いて、最先端のGemini 3 Proに匹敵するパフォーマンスを実現している。これらの結果から、推論は適切なアンカーを施すとモダリティ間で伝達可能な能力であり、拡張された議論を負債からオーディオインテリジェンスのための強力な資産へと変換することを示した。最初の成功例の音声推論モデルを確立することで、Step-Audio-R1は、あらゆる感覚モーダルを深く考える真のマルチモーダル推論システムを構築するための新たな道を開く。

論文の概要: Step-Audio-R1 Technical Report

関連論文リスト