Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling
Scheme
- URL: http://arxiv.org/abs/2109.13821v1
- Date: Tue, 28 Sep 2021 15:48:22 GMT
- Title: Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling
Scheme
- Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail
Kudinov, Jiansheng Wei
- Abstract summary: The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset.
We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches.
- Score: 4.053320933149689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice conversion is a common speech synthesis task which can be solved in
different ways depending on a particular real-world scenario. The most
challenging one often referred to as one-shot many-to-many voice conversion
consists in copying the target voice from only one reference utterance in the
most general case when both source and target speakers do not belong to the
training dataset. We present a scalable high-quality solution based on
diffusion probabilistic modeling and demonstrate its superior quality compared
to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on
real-time applications, we investigate general principles which can make
diffusion models faster while keeping synthesis quality at a high level. As a
result, we develop a novel Stochastic Differential Equations solver suitable
for various diffusion model types and generative tasks as shown through
empirical studies and justify it by theoretical analysis.
Related papers
- A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation [32.648815593259485]
Training diffusion models for audiovisual sequences allows for a range of generation tasks.
We propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.
arXiv Detail & Related papers (2024-05-22T15:47:14Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - DiT-Head: High-Resolution Talking Head Synthesis using Diffusion
Transformers [2.1408617023874443]
"DiT-Head" is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model.
We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis.
arXiv Detail & Related papers (2023-12-11T14:09:56Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion
Model [1.0874597293913013]
UnDiff is a diffusion probabilistic model capable of solving various speech inverse tasks.
It can be adapted to different tasks including inversion degradation, neural vocoding, and source separation.
arXiv Detail & Related papers (2023-06-01T14:22:55Z) - SE-Bridge: Speech Enhancement with Consistent Brownian Bridge [18.37042387650827]
We propose SE-Bridge, a novel method for speech enhancement (SE)
Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state.
By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training.
arXiv Detail & Related papers (2023-05-23T08:06:36Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.