Related papers: StyleStream: Real-Time Zero-Shot Voice Style Conversion

StyleStream: Real-Time Zero-Shot Voice Style Conversion

URL: http://arxiv.org/abs/2602.20113v1
Date: Mon, 23 Feb 2026 18:32:59 GMT
Title: StyleStream: Real-Time Zero-Shot Voice Style Conversion
Authors: Yisi Liu, Nicholas Lee, Gopala Anumanchipalli,
Abstract summary: StyleStream is a zero-shot voice style conversion system that achieves state-of-the-art performance.<n>Design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.
Score: 14.496282800974141
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.

Related papers

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions [66.93932684284695]
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation.<n>We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style.<n>We present VStyle, a benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy.<n>We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness.
arXiv Detail & Related papers (2025-09-09T14:28:58Z)
Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion [53.26424100244925]
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech.<n>In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder.
arXiv Detail & Related papers (2025-06-04T14:42:12Z)
Style Mixture of Experts for Expressive Text-To-Speech Synthesis [7.6732312922460055]
StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
arXiv Detail & Related papers (2024-06-05T22:17:47Z)
Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z)
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization [66.42741426640633]
DiffStyler is a dual diffusion processing architecture to control the balance between the content and style of diffused results. We propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image.
arXiv Detail & Related papers (2022-11-19T12:30:44Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Fine-grained style control in Transformer-based Text-to-speech Synthesis [78.92428622630861]
We present a novel architecture to realize fine-grained style control on the Transformer-based text-to-speech synthesis (TransformerTTS) We model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability.
arXiv Detail & Related papers (2021-10-12T19:50:02Z)
Global Rhythm Style Transfer Without Text Transcriptions [98.09972075975976]
Prosody plays an important role in characterizing the style of a speaker or an emotion. Most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. We propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
arXiv Detail & Related papers (2021-06-16T02:21:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.