Fugu-MT 論文翻訳(概要): SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

論文の概要: SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

arxiv url: http://arxiv.org/abs/2603.09215v1
Date: Tue, 10 Mar 2026 05:39:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.060135
Title: SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
Title（参考訳）: SPAR-K:音声言語モデルの周期交代早期終了をスケジューリングする
Authors: Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi Lee,
Abstract要約: インターリーブ音声言語モデル(SLM)はテキストと音声トークンを交互に生成するが、ステップ毎にフルトランスフォーマー深さでの復号はコストがかかる。 SPAR-Kは、知覚品質を維持しつつ、インターリーブされたSLM推論を高速化するために設計されたモダリティ対応早期終了フレームワークである。我々は,4つのデータセットにまたがるステップAudio-2-mini と GLM-4-Voice を用いて,推論,事実QA,対話タスクを対象とするフレームワークの評価を行った。
参考スコア（独自算出の注目度）: 56.525932945429275
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth "refresh" steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\% while reducing average speech decoding depth by up to 11\% on Step-Audio-2-mini and 5\% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.
Abstract（参考訳）: インターリーブ音声言語モデル(SLM)は、テキストと音声トークンを交互に生成するが、各ステップのフルトランスフォーマー深さでの復号は、特に長い音声シーケンスのためにコストがかかる。 SPAR-Kは、知覚品質を維持しつつ、インターリーブされたSLM推論を高速化するために設計されたモダリティ対応早期終了フレームワークである。 SPAR-Kは、ほとんどの音声が固定された中間層で終了するのに対し、周期的な全深度「更新」ステップは、早期終了による分散シフトを緩和する。我々は,ASR転写精度と知覚品質の観点から,推論,事実QA,対話タスクにまたがる4つのデータセットにまたがるステップAudio-2-miniとGLM-4-Voiceを用いて,我々のフレームワークを評価する。実験結果から,SPAR-Kは平均音声復号深度をステップオーディオ-2-miniで最大11倍,GLM-4-Voiceで最大5倍まで低減し,MOSとWERの無視的変化と補助的計算オーバーヘッドを伴わず,問合せ精度を最大0.82倍に抑えることがわかった。さらに,テキスト LLM で広く使用されている信頼に基づく早期終了戦略が SLM に最適であることを示すとともに,音声トークンのユニークな統計的性質が特別な早期終了設計を必要とすることを強調した。

論文の概要: SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

関連論文リスト