Related papers: CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

URL: http://arxiv.org/abs/2209.11112v3
Date: Fri, 3 May 2024 21:38:45 GMT
Title: CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement
Authors: Sherif Abdulatif, Ruizhe Cao, Bin Yang,
Abstract summary: We further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution.
Score: 5.766499647507758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.

Related papers

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection [2.432576583937997]
Spoof-SUPERB is a benchmark for audio deepfake detection.<n>We evaluate 20 SSL models spanning generative, discriminative, and spectrogram-based architectures.
arXiv Detail & Related papers (2026-03-02T05:45:55Z)
High-Fidelity Speech Enhancement via Discrete Audio Tokens [35.61634772862795]
DAC-SE1 is a language model-based SE framework leveraging discrete high-resolution audio representations.<n>Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation.
arXiv Detail & Related papers (2025-10-02T16:38:05Z)
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z)
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis [4.834986020597738]
We propose Deformable Periodic Network based GAN (DPN-GAN)<n>DPN-GAN incorporates a kernel-based periodic ReLU activation function to induce periodic bias in audio generation.<n>We trained two versions of the model: DPN-GAN small (38.67M parameters) and DPN-GAN large (124M parameters)
arXiv Detail & Related papers (2025-05-14T02:52:16Z)
Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance [42.90024643696503]
We present an end-to-end learning solution to jointly optimise the models for audio enhancement. We consider four representative applications to evaluate our training paradigm.
arXiv Detail & Related papers (2024-08-12T16:23:58Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder. Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach. It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
Unsupervised speech enhancement with deep dynamical generative speech and noise models [26.051535142743166]
This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both.
arXiv Detail & Related papers (2023-06-13T14:52:35Z)
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA) Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z)
Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z)
Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain [18.090665052145653]
We propose a novel training task for speech enhancement using a complex-valued deep neural network. We derive this training task within the formalism of differential equations, thereby enabling the use of predictor-corrector samplers.
arXiv Detail & Related papers (2022-03-31T12:53:47Z)
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language. We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z)
Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN) TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem. In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.