Related papers: A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments

A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments

URL: http://arxiv.org/abs/2506.15000v1
Date: Tue, 17 Jun 2025 22:12:40 GMT
Title: A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments
Authors: Md Jahangir Alam Khondkar, Ajan Ahmed, Masudul Haider Imtiaz, Stephanie Schuckers,
Abstract summary: This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net on diverse datasets such as SpEAR, VPQAD, and Clarkson.<n>The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on Clarkson.<n> CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech.
Score: 1.0499611180329804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.

Related papers

Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations [39.31175048498422]
This study proposes Voice Quality Assessment Network (VOQANet), a deep learning-based framework with an attention mechanism to extract high-level acoustic and prosodic information from raw speech.<n>To enhance interpretability, we also introduce VOQANet+, which integrates low-level speech descriptors such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with SFM embeddings into a hybrid representation.<n>Results show that sentence-based input outperforms vowel-based input, especially at the patient level, underscoring the value of longer utterances for capturing voice attributes.
arXiv Detail & Related papers (2025-05-27T15:48:17Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining [21.26555178371168]
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame.<n>Deep neural network-based models have shown good performance in this task.<n>We propose a causal, Self-Supervised Learning (SSL) pretraining framework to enhance TS-VAD performance in noisy conditions.
arXiv Detail & Related papers (2025-01-06T18:00:14Z)
HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids [30.305000305766193]
This paper introduces HAAQI-Net, a non-intrusive deep learning-based music audio quality assessment model for hearing aid users.<n>It can predict HAAQI scores directly from music audio clips and hearing loss patterns.
arXiv Detail & Related papers (2024-01-02T10:55:01Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)
A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification [25.082714256583422]
In noisy environments, speech can be hard to understand for humans. We create a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing. We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB.
arXiv Detail & Related papers (2022-10-19T02:20:17Z)
MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment [12.144133923535714]
This paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric. It can predict room acoustics parameters alongside the overall mean opinion score (MOS) for speech quality. We also show that this joint training method enhances the blind estimation of room acoustics.
arXiv Detail & Related papers (2022-04-04T09:38:15Z)
CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC) Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.