Fugu-MT 論文翻訳(概要): Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

論文の概要: Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

arxiv url: http://arxiv.org/abs/2508.15827v1
Date: Mon, 18 Aug 2025 15:14:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-25 16:42:36.095963
Title: Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
Title（参考訳）: Mini-Omni-Reasoner:大規模音声モデルにおけるToken-Level Thinking-in-Speaking
Authors: Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, Shuicheng Yan,
Abstract要約: Mini-Omni-Reasonerは、"Thinking-in-Speaking"という新しい定式化を通じて、音声内での推論を可能にするフレームワークである。トークンレベルで音声応答トークンとサイレント推論トークンをインターリーブする。算術的推論では+19.1%、文脈的理解では+6.4%、出力は短く、復号遅延はゼロである。
参考スコア（独自算出の注目度）: 80.75260664100644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
Abstract（参考訳）: 推論は効果的なコミュニケーションと意思決定に不可欠である。 LLMとMLLMの最近の進歩は、明示的推論を取り入れることによって理解と一般化が著しく向上することを示しているが、LSMの推論は初期段階にある。初期の試みは、"Thinking-before-Speaking"パラダイムをテキストモデルから音声に移行しようとした。しかし、このシーケンシャルな定式化は、推論が完全に完了するまで音声応答が遅れ、リアルタイムの対話や通信効率を損なうため、顕著な遅延をもたらす。そこで本稿では, 音声中の推論を可能にするフレームワークであるMini-Omni-Reasonerを提案する。動詞出力を生成する前に推論を完了する代わりに、Mini-Omni-Reasonerはトークンレベルで音声応答トークンでサイレント推論トークンをインターリーブする。この設計は、構造的内部推論を埋め込みながら連続的な音声生成を可能にし、モデルの高周波トークン処理能力を活用する。インターリーブされているが、各応答トークンが先行する推論によって確実に通知されるように、局所的なセマンティックアライメントが実行される。このフレームワークをサポートするために、我々は、インターリーブされた推論と応答に適した大規模データセットであるSpken-Math-Problems-3Mを紹介した。データセットは、言語トークンが関連する推論内容に一貫して従うことを保証し、音声結合推論の正確かつ効率的な学習を可能にする。階層的なThinker-Talkerアーキテクチャに基づいて構築されたMini-Omni-Reasonerは、自然性と正確性の両方を維持しながら、流動的で論理的に基盤付けられた音声応答を提供する。 Spoken-MQAベンチマークでは、算術的推論では+19.1%、文脈的理解では+6.4%、出力は短く、復号遅延はゼロである。

論文の概要: Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

関連論文リスト