Related papers: Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

URL: http://arxiv.org/abs/2512.23578v2
Date: Sun, 04 Jan 2026 01:36:25 GMT
Title: Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Authors: Yu-Xiang Lin, Cheng-Han Chiang, Hung-yi Lee,
Abstract summary: When spoken language models (SLMs) are instructed to speak in a specific speaking style, they cannot maintain the required speaking styles after several turns of interaction.<n>We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed.<n> explicitly asking the model to recall the style instruction can partially mitigate style amnesia.
Score: 61.494659340367605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.

Related papers

F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z)
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions [66.93932684284695]
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation.<n>We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style.<n>We present VStyle, a benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy.<n>We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness.
arXiv Detail & Related papers (2025-09-09T14:28:58Z)
Dual Information Speech Language Models for Emotional Conversations [48.094826104102204]
Speech-language models (SLMs), which use speech as input, are emerging as a promising solution.<n>We identify entangled information and improper training strategies as key issues.<n>Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations.
arXiv Detail & Related papers (2025-08-11T15:33:44Z)
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations [65.29513437838457]
Even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. We propose Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles.
arXiv Detail & Related papers (2024-02-20T07:51:43Z)
Conversation Style Transfer using Few-Shot Learning [56.43383396058639]
In this paper, we introduce conversation style transfer as a few-shot learning problem. We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot. We show that conversation style transfer can also benefit downstream tasks.
arXiv Detail & Related papers (2023-02-16T15:27:00Z)
Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis [17.650661515807993]
We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
arXiv Detail & Related papers (2021-10-30T08:15:27Z)
Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B' We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z)
Learning to mirror speaking styles incrementally [0.0]
Mirroring is the behavior in which one person subconsciously imitates the gesture, speech pattern, or attitude of another. In this work, we explore a method that can learn to mirror the speaking styles of a person incrementally. Our method extracts ngrams that capture a persons speaking styles and uses the ngrams to create patterns for transforming sentences to the persons speaking styles.
arXiv Detail & Related papers (2020-03-05T02:54:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.