Pre-training Feature Guided Diffusion Model for Speech Enhancement
- URL: http://arxiv.org/abs/2406.07646v1
- Date: Tue, 11 Jun 2024 18:22:59 GMT
- Title: Pre-training Feature Guided Diffusion Model for Speech Enhancement
- Authors: Yiyuan Yang, Niki Trigoni, Andrew Markham,
- Abstract summary: Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments.
We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
- Score: 37.88469730135598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.
Related papers
- Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up
Speech Diffusion Model [32.09697176638031]
Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks.
We propose an inquiry: is it possible to enhance the training/inference speed and performance of DDPMs by modifying the speech signal itself?
In this paper, we double the training and inference speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain.
arXiv Detail & Related papers (2024-02-16T12:43:01Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR.
The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech
with Adversarial Learning and Architecture Design [7.005639198341213]
We introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech.
We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness.
We demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method.
arXiv Detail & Related papers (2023-07-31T06:36:44Z) - Variance-Preserving-Based Interpolation Diffusion Models for Speech
Enhancement [53.2171981279647]
We present a framework that encapsulates both the VP- and variance-exploding (VE)-based diffusion methods.
To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models.
We evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-14T14:22:22Z) - A Novel Speech Intelligibility Enhancement Model based on
CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model.
We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.