High-Resolution Speech Restoration with Latent Diffusion Model
- URL: http://arxiv.org/abs/2409.11145v1
- Date: Tue, 17 Sep 2024 12:55:23 GMT
- Title: High-Resolution Speech Restoration with Latent Diffusion Model
- Authors: Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu,
- Abstract summary: Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics.
We propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality.
We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details.
- Score: 24.407232363131534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.
Related papers
- MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation [29.620451579580763]
We propose a novel motion-disentangled diffusion model for talking head generation, dubbed MoDiTalker.
We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion.
Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models.
arXiv Detail & Related papers (2024-03-28T04:35:42Z) - Text Diffusion with Reinforced Conditioning [92.17397504834825]
This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling.
Motivated by our findings, we propose a novel Text Diffusion model called TREC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling.
arXiv Detail & Related papers (2024-02-19T09:24:02Z) - SMRD: SURE-based Robust MRI Reconstruction with Diffusion Models [76.43625653814911]
Diffusion models have gained popularity for accelerated MRI reconstruction due to their high sample quality.
They can effectively serve as rich data priors while incorporating the forward model flexibly at inference time.
We introduce SURE-based MRI Reconstruction with Diffusion models (SMRD) to enhance robustness during testing.
arXiv Detail & Related papers (2023-10-03T05:05:35Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation [21.335983674309475]
Diffusion models suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation.
We introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query.
We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space.
arXiv Detail & Related papers (2023-09-19T16:36:33Z) - Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image
Denoising [16.43285056788183]
We propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG)
Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal.
It employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality.
arXiv Detail & Related papers (2023-09-19T16:01:20Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly
Detection [89.49600182243306]
We reformulate the reconstruction process using a diffusion model into a noise-to-norm paradigm.
We propose a rapid one-step denoising paradigm, significantly faster than the traditional iterative denoising in diffusion models.
The segmentation sub-network predicts pixel-level anomaly scores using the input image and its anomaly-free restoration.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Low-Complexity Models for Acoustic Scene Classification Based on
Receptive Field Regularization and Frequency Damping [7.0349768355860895]
We investigate and compare several well-known methods to reduce the number of parameters in neural networks.
We show that we can achieve high-performing low-complexity models by applying specific restrictions on the Receptive Field.
We propose a filter-damping technique for regularizing the RF of models, without altering their architecture.
arXiv Detail & Related papers (2020-11-05T16:34:11Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.