Fugu-MT 論文翻訳(概要): EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

論文の概要: EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

arxiv url: http://arxiv.org/abs/2508.14130v1
Date: Tue, 19 Aug 2025 06:58:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.213125
Title: EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Title（参考訳）: EmoSLLM:音声感情認識のためのLLMのパラメータ効率の良い適応
Authors: Hugo Thimonier, Antony Perzo, Renaud Seguier,
Abstract要約: 音声からの感情認識は言語とパラ言語の両方を捉えることを必要とする難しい課題である。最近の研究は、Large Language Models(LLM)が唯一の自然言語領域の外でタスクを実行する能力を強調している。本研究は、感情予測のための音声およびテキスト表現を備えたLLMを微調整する新しいアプローチを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues, with critical applications in human-computer interaction and mental health monitoring. Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area. In particular, recent approaches have investigated coupling LLMs with other data modalities by using pre-trained backbones and different fusion mechanisms. This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction. Our method first extracts audio features using an audio feature extractor, which are then mapped into the LLM's representation space via a learnable interfacing module. The LLM takes as input (1) the transformed audio features, (2) additional features in the form of natural language (e.g., the transcript), and (3) a textual prompt describing the emotion prediction task. To efficiently adapt the LLM to this multimodal task, we employ Low-Rank Adaptation (LoRA), enabling parameter-efficient fine-tuning. Experimental results on standard emotion recognition benchmarks demonstrate that our model outperforms all but one existing Speech-Text LLMs in the literature, while requiring less than half the parameters of competing approaches. This highlights our approach's effectiveness in integrating multi-modal inputs for speech-based emotion understanding while maintaining significant computational efficiency.
Abstract（参考訳）: 音声からの感情認識は言語とパラ言語の両方を捉えることが必要な課題であり、人間とコンピュータの相互作用やメンタルヘルスモニタリングに重要な応用がある。最近の研究は、Large Language Models(LLM)が唯一の自然言語領域の外でタスクを実行する能力を強調している。特に最近の研究では, 事前学習したバックボーンと異なる融合機構を用いて, LLMと他のデータモダリティの結合について検討している。本研究は、感情予測のための音声およびテキスト表現を備えたLLMを微調整する新しいアプローチを提案する。提案手法はまず音声特徴抽出器を用いて音声特徴を抽出し,学習可能なインターフェースモジュールを介してLLMの表現空間にマッピングする。 LLMは、(1)変換された音声特徴、(2)自然言語(例えば、転写文)の形の付加的特徴、(3)感情予測タスクを記述するテキストプロンプトを入力とする。このマルチモーダルタスクにLLMを効率よく適応させるために、パラメータ効率の良い微調整を可能にするLoRA(Lo-Rank Adaptation)を用いる。標準的な感情認識ベンチマークによる実験結果から、我々のモデルは既存の1つの音声テキストLLMよりも優れており、競合するアプローチのパラメータの半分以下であることがわかった。このことは、音声に基づく感情理解のためのマルチモーダル入力の統合において、重要な計算効率を維持しながら、我々のアプローチの有効性を強調している。

論文の概要: EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

関連論文リスト