SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
- URL: http://arxiv.org/abs/2305.13796v1
- Date: Tue, 23 May 2023 08:06:36 GMT
- Title: SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
- Authors: Zhibin Qiu, Mengfan Fu, Fuchun Sun, Gulila Altenbek, Hao Huang
- Abstract summary: We propose SE-Bridge, a novel method for speech enhancement (SE)
Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state.
By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training.
- Score: 18.37042387650827
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose SE-Bridge, a novel method for speech enhancement (SE). After
recently applying the diffusion models to speech enhancement, we can achieve
speech enhancement by solving a stochastic differential equation (SDE). Each
SDE corresponds to a probabilistic flow ordinary differential equation
(PF-ODE), and the trajectory of the PF-ODE solution consists of the speech
states at different moments. Our approach is based on consistency model that
ensure any speech states on the same PF-ODE trajectory, correspond to the same
initial state. By integrating the Brownian Bridge process, the model is able to
generate high-intelligibility speech samples without adversarial training. This
is the first attempt that applies the consistency models to SE task, achieving
state-of-the-art results in several metrics while saving 15 x the time required
for sampling compared to the diffusion-based baseline. Our experiments on
multiple datasets demonstrate the effectiveness of SE-Bridge in SE.
Furthermore, we show through extensive experiments on downstream tasks,
including Automatic Speech Recognition (ASR) and Speaker Verification (SV),
that SE-Bridge can effectively support multiple downstream tasks.
Related papers
- Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation [58.19676004192321]
Diffusion models (DMs), which enable both image generation from noise and inversion from data, have inspired powerful unpaired image-to-image (I2I) translation algorithms.
We tackle this problem with Schrodinger Bridges (SBs), which are differential equations (SDEs) between distributions with minimal transport cost.
Inspired by this observation, we propose Latent Schrodinger Bridges (LSBs) that approximate the SB ODE via pre-trained Stable Diffusion.
We demonstrate that our algorithm successfully conduct competitive I2I translation in unsupervised setting with only a fraction of cost required by previous DM-
arXiv Detail & Related papers (2024-11-22T11:24:14Z) - Consistency Diffusion Bridge Models [25.213664260896103]
Diffusion bridge models (DDBMs) build processes between fixed data endpoints based on a reference diffusion process.
DDBMs' sampling process typically requires hundreds of network evaluations to achieve decent performance.
We propose two paradigms: consistency bridge distillation and consistency bridge training, which is flexible to apply on DDBMs with broad design choices.
arXiv Detail & Related papers (2024-10-30T02:04:23Z) - Diffusion Bridge Implicit Models [25.213664260896103]
Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions.
We take the first step in fast sampling of DDBMs without extra training, motivated by the well-established recipes in diffusion models.
We induce a novel, simple, and insightful form of ordinary differential equation (ODE) which inspires high-order numerical solvers.
arXiv Detail & Related papers (2024-05-24T19:08:30Z) - Fast Ensembling with Diffusion Schrödinger Bridge [17.334437293164566]
Deep Ensemble (DE) approach is a straightforward technique used to enhance the performance of deep neural networks by training them from different initial points, converging towards various local optima.
We propose a novel approach called Diffusion Bridge Network (DBN) to address this challenge.
By substituting the heavy ensembles with this lightweight neural network DBN, we achieved inference with reduced computational cost while maintaining accuracy and uncertainty scores on benchmark datasets such as CIFAR-10, CIFAR-100, and TinyImageNet.
arXiv Detail & Related papers (2024-04-24T11:35:02Z) - ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages.
Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation.
ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z) - Gaussian Mixture Solvers for Diffusion Models [84.83349474361204]
We introduce a novel class of SDE-based solvers called GMS for diffusion models.
Our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis.
arXiv Detail & Related papers (2023-11-02T02:05:38Z) - Variance-Preserving-Based Interpolation Diffusion Models for Speech
Enhancement [53.2171981279647]
We present a framework that encapsulates both the VP- and variance-exploding (VE)-based diffusion methods.
To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models.
We evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-14T14:22:22Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling
Scheme [4.053320933149689]
The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset.
We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches.
arXiv Detail & Related papers (2021-09-28T15:48:22Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.