Analysing Diffusion-based Generative Approaches versus Discriminative
Approaches for Speech Restoration
- URL: http://arxiv.org/abs/2211.02397v1
- Date: Fri, 4 Nov 2022 12:06:14 GMT
- Title: Analysing Diffusion-based Generative Approaches versus Discriminative
Approaches for Speech Restoration
- Authors: Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann
- Abstract summary: We systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks.
We observe that the generative approach performs globally better than its discriminative counterpart on all tasks.
- Score: 16.09633286837904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based generative models have had a high impact on the computer
vision and speech processing communities these past years. Besides data
generation tasks, they have also been employed for data restoration tasks like
speech enhancement and dereverberation. While discriminative models have
traditionally been argued to be more powerful e.g. for speech enhancement,
generative diffusion approaches have recently been shown to narrow this
performance gap considerably. In this paper, we systematically compare the
performance of generative diffusion models and discriminative approaches on
different speech restoration tasks. For this, we extend our prior contributions
on diffusion-based speech enhancement in the complex time-frequency domain to
the task of bandwith extension. We then compare it to a discriminatively
trained neural network with the same network architecture on three restoration
tasks, namely speech denoising, dereverberation and bandwidth extension. We
observe that the generative approach performs globally better than its
discriminative counterpart on all tasks, with the strongest benefit for
non-additive distortion models, like in dereverberation and bandwidth
extension. Code and audio examples can be found online at
https://uhh.de/inf-sp-sgmsemultitask
Related papers
- SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Investigating the Design Space of Diffusion Models for Speech Enhancement [17.914763947871368]
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature.
We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals.
We also show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system.
arXiv Detail & Related papers (2023-12-07T15:40:55Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion
Model [1.0874597293913013]
UnDiff is a diffusion probabilistic model capable of solving various speech inverse tasks.
It can be adapted to different tasks including inversion degradation, neural vocoding, and source separation.
arXiv Detail & Related papers (2023-06-01T14:22:55Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.