Fugu-MT 論文翻訳(概要): HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

論文の概要: HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

arxiv url: http://arxiv.org/abs/2509.18570v1
Date: Tue, 23 Sep 2025 02:53:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.66518
Title: HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
Title（参考訳）: HarmoniFuse:マルチタスク音声言語モデリングのためのコンポーネント選択およびプロンプト適応フレームワーク
Authors: Yuke Si, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang,
Abstract要約: HarmoniFuseは、マルチタスク音声言語モデリングのためのコンポーネント選択およびプロンプト適応フレームワークである。バッチインターリーブのトレーニング戦略により、ジョイントアノテーションを必要とせずに、別々のASRとSERデータセットを活用することができる。
参考スコア（独自算出の注目度）: 52.537908557508324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integration of both linguistic and paralinguistic cues. Existing multitask SLMs typically adopt naive parameter sharing or prompt-based conditioning without explicitly modeling the differences in information composition required by each task. Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. HarmoniFuse is designed to harmonize heterogeneous task demands by selecting and fusing task-relevant components of speech representations. Specifically, it integrates a gated speech encoder to extract task-specific acoustic features and a prompt-adaptive dynamic fusion module to aggregate transformer layers based on task characteristics. In addition, a batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation. Experimental results demonstrate that HarmoniFuse improves both ASR and SER performance, offering a scalable and robust solution for multitask speech understanding under realistic data constraints.
Abstract（参考訳）: 大規模言語モデルの最近の進歩は、共有アーキテクチャ内で複数の音声タスクをサポートすることができる統一言語モデル(SLM)の開発を促進する。しかしながら、自動音声認識(ASR)や音声感情認識(SER)といったタスクは、異なるタイプの情報に依存している。既存のマルチタスクSLMでは、各タスクに必要な情報構成の違いを明示的にモデル化することなく、単純パラメータ共有やプロンプトベースの条件付けを採用するのが一般的である。このような設計は、特に限られたデータ条件下でのタスク干渉と性能劣化を危険にさらす。これらの制約に対処するため,マルチタスク言語モデリングのためのコンポーネント選択型およびプロンプト適応型フレームワークであるHarmoniFuseを提案する。 HarmoniFuseは、音声表現のタスク関連コンポーネントを選択し、融合させることにより、不均一なタスク要求を調和させるように設計されている。具体的には、ゲート音声エンコーダを統合し、タスク固有の音響特徴を抽出し、プロンプト適応動的融合モジュールをタスク特性に基づいて変換器層を集約する。さらに、バッチインターリーブトレーニング戦略により、ジョイントアノテーションを必要とせずに、別々のASRとSERデータセットを活用することができる。実験により、HarmoniFuseはASRとSERの両方の性能を改善し、現実的なデータ制約下でのマルチタスク音声理解のためのスケーラブルで堅牢なソリューションを提供することを示した。

論文の概要: HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

関連論文リスト