Dual Information Speech Language Models for Emotional Conversations
- URL: http://arxiv.org/abs/2508.08095v1
- Date: Mon, 11 Aug 2025 15:33:44 GMT
- Title: Dual Information Speech Language Models for Emotional Conversations
- Authors: Chun Wang, Chenyang Liu, Wenze Xu, Weihong Deng,
- Abstract summary: Speech-language models (SLMs), which use speech as input, are emerging as a promising solution.<n>We identify entangled information and improper training strategies as key issues.<n>Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations.
- Score: 48.094826104102204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model's ability to effectively integrate both paralinguistic and linguistic information within contextual settings.
Related papers
- EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition [0.0]
Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues.<n>Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area.<n>This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction.
arXiv Detail & Related papers (2025-08-19T06:58:16Z) - Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models [19.864555505996112]
We propose two approaches to incorporate contextual paralinguistic information into model training.<n>Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach.
arXiv Detail & Related papers (2025-08-10T10:03:30Z) - ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
This work studies the capabilities of a large language model (LLM) to understand paralinguistic aspects of speech without fine-tuning its weights.<n>We utilize an end-to-end system with a speech encoder, which is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt.
arXiv Detail & Related papers (2024-10-02T01:32:47Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.