Related papers: Pre-training Feature Guided Diffusion Model for Speech Enhancement

Pre-training Feature Guided Diffusion Model for Speech Enhancement

URL: http://arxiv.org/abs/2406.07646v1
Date: Tue, 11 Jun 2024 18:22:59 GMT
Title: Pre-training Feature Guided Diffusion Model for Speech Enhancement
Authors: Yiyuan Yang, Niki Trigoni, Andrew Markham,
Abstract summary: Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments. We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
Score: 37.88469730135598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.

Related papers

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR) MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Multi-modal expressive personality recognition in data non-ideal audiovisual based on multi-scale feature enhancement and modal augment [10.157685076725791]
An end-to-end multimodal performance personality is established for both visual and auditory modal datarecognition network. A is proposed multiscale feature enhancement modalities module. During the training process, this paper proposes a modal enhancement training strategy to simulate non-ideal data scenarios.
arXiv Detail & Related papers (2025-03-08T07:20:44Z)
Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field. In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data. In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z)
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model [30.771631264129763]
Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. We propose an inquiry: is it possible to enhance the training/inference speed and performance of DDPMs by modifying the speech signal itself? In this paper, we double the training and inference speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain.
arXiv Detail & Related papers (2024-02-16T12:43:01Z)
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation. SpeechGPT-Gen is efficient in semantic and perceptual information modeling. It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z)
Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities [0.9217021281095907]
DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework. DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification. We introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.
arXiv Detail & Related papers (2024-01-20T06:42:32Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models. We also propose a simple pre-training technique that leads to fully or partially shared models. Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z)
VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design [7.005639198341213]
We introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness. We demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method.
arXiv Detail & Related papers (2023-07-31T06:36:44Z)
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement [53.2171981279647]
We present a framework that encapsulates both the VP- and variance-exploding (VE)-based diffusion methods. To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models. We evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-14T14:22:22Z)
A Novel Speech Intelligibility Enhancement Model based on CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model. We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z)
Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.