Related papers: Voice Impression Control in Zero-Shot TTS

Voice Impression Control in Zero-Shot TTS

URL: http://arxiv.org/abs/2506.05688v2
Date: Mon, 09 Jun 2025 23:14:18 GMT
Title: Voice Impression Control in Zero-Shot TTS
Authors: Keinichi Fujita, Shota Horiguchi, Yusuke Ijima,
Abstract summary: We develop a voice impression control method in zero-shot text-to-speech.<n>We use a low-dimensional vector to represent the intensities of various voice impression pairs.<n>The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
Score: 15.46515385197271
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

Related papers

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VC is a novel zero-shot voice conversion framework.<n>It disentangles content and prosody information from self-supervised speech representations.<n>It synthesizes the target speaker's voice through in-context learning.
arXiv Detail & Related papers (2025-05-30T07:04:23Z)
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement [18.669597283434662]
Vevo is a versatile zero-shot voice imitation framework with controllable timbre and style.<n>To obtain the content and content-style tokens of speech, we design a fully self-supervised approach.<n>Vevo matches or surpasses existing methods in accent and emotion conversion tasks.
arXiv Detail & Related papers (2025-02-11T04:18:33Z)
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control [50.27383290553548]
ControlSpeech is a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style.<n>We show that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability.
arXiv Detail & Related papers (2024-06-03T11:15:16Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer [0.0]
Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset. In this paper, we propose a style control (SC) VALL-E model based on the neural language model (called VALL-E) The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech.
arXiv Detail & Related papers (2023-07-20T03:28:06Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
Controllable speech synthesis by learning discrete phoneme-level prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z)
Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction. Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning. A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.