VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
- URL: http://arxiv.org/abs/2509.09716v2
- Date: Mon, 22 Sep 2025 02:40:04 GMT
- Title: VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
- Authors: Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng,
- Abstract summary: Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation.<n>We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style.<n>We present VStyle, a benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy.<n>We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness.
- Score: 66.93932684284695
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project's homepage}.
Related papers
- ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech [15.969757677847504]
ParaMETA is a framework for learning and controlling speaking styles directly from speech.<n>It learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style.<n>It supports both speech- and text-based prompting and allows users to modify one speaking style while preserving others.
arXiv Detail & Related papers (2026-01-18T07:05:40Z) - BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs [84.59993864748195]
We propose a new paradigm inspired by operationalism'' that decouples instruction understanding from speech generation.<n>We introduce BatonVoice, a framework where an LLM acts as a conductor'', understanding user instructions.<n>A separate TTS model, the orchestra'', then generates the speech from these features.
arXiv Detail & Related papers (2025-09-30T16:52:14Z) - Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations [65.29513437838457]
Even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles.
We propose Spoken-LLM framework that can model the linguistic content and the speaking styles.
We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles.
arXiv Detail & Related papers (2024-02-20T07:51:43Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - StyleCap: Automatic Speaking-Style Captioning from Speech Based on
Speech and Language Self-supervised Learning Models [17.945821635380614]
StyleCap is a method to generate natural language descriptions of speaking styles appearing in speech.
StyleCap is trained with paired data of speech and natural language descriptions.
arXiv Detail & Related papers (2023-11-28T04:49:17Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.