PerMod: Perceptually Grounded Voice Modification with Latent Diffusion
Models
- URL: http://arxiv.org/abs/2312.08494v1
- Date: Wed, 13 Dec 2023 20:14:27 GMT
- Title: PerMod: Perceptually Grounded Voice Modification with Latent Diffusion
Models
- Authors: Robin Netzorg, Ajil Jalal, Luna McNulty, Gopala Krishna Anumanchipalli
- Abstract summary: PerMod is a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector.
Unlike prior work, PerMod generates a new voice corresponding to specific perceptual modifications.
We demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.
- Score: 5.588733538696248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perceptual modification of voice is an elusive goal. While non-experts can
modify an image or sentence perceptually with available tools, it is not clear
how to similarly modify speech along perceptual axes. Voice conversion does
make it possible to convert one voice to another, but these modifications are
handled by black box models, and the specifics of what perceptual qualities to
modify and how to modify them are unclear. Towards allowing greater perceptual
control over voice, we introduce PerMod, a conditional latent diffusion model
that takes in an input voice and a perceptual qualities vector, and produces a
voice with the matching perceptual qualities. Unlike prior work, PerMod
generates a new voice corresponding to specific perceptual modifications.
Evaluating perceptual quality vectors with RMSE from both human and predicted
labels, we demonstrate that PerMod produces voices with the desired perceptual
qualities for typical voices, but performs poorly on atypical voices.
Related papers
- uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - Automatic Speech Disentanglement for Voice Conversion using Rank Module
and Speech Augmentation [4.961389445237138]
Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content.
We propose a VC model that can automatically disentangle speech into four components using only two augmentation functions.
arXiv Detail & Related papers (2023-06-21T13:28:06Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Defending Your Voice: Adversarial Attack on Voice Conversion [70.19396655909455]
We report the first known attempt to perform adversarial attack on voice conversion.
We introduce human noise imperceptible into the utterances of a speaker whose voice is to be defended.
It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker.
arXiv Detail & Related papers (2020-05-18T14:51:54Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.