Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments
- URL: http://arxiv.org/abs/2309.06183v2
- Date: Wed, 8 Nov 2023 08:09:37 GMT
- Title: Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments
- Authors: Philippe Gonzalez, Tommy Sonne Alstr{\o}m, Tobias May
- Abstract summary: Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or room impulse response database.
The present study introduces a generalization assessment framework that uses a reference model trained on the test condition.
The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), ConvTasNet, DCCRN and MANNER.
- Score: 0.7366405857677227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The acoustic variability of noisy and reverberant speech mixtures is
influenced by multiple factors, such as the spectro-temporal characteristics of
the target speaker and the interfering noise, the signal-to-noise ratio (SNR)
and the room characteristics. This large variability poses a major challenge
for learning-based speech enhancement systems, since a mismatch between the
training and testing conditions can substantially reduce the performance of the
system. Generalization to unseen conditions is typically assessed by testing
the system with a new speech, noise or binaural room impulse response (BRIR)
database different from the one used during training. However, the difficulty
of the speech enhancement task can change across databases, which can
substantially influence the results. The present study introduces a
generalization assessment framework that uses a reference model trained on the
test condition, such that it can be used as a proxy for the difficulty of the
test condition. This allows to disentangle the effect of the change in task
difficulty from the effect of dealing with new data, and thus to define a new
measure of generalization performance termed the generalization gap. The
procedure is repeated in a cross-validation fashion by cycling through multiple
speech, noise, and BRIR databases to accurately estimate the generalization
gap. The proposed framework is applied to evaluate the generalization potential
of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find
that for all models, the performance degrades the most in speech mismatches,
while good noise and room generalization can be achieved by training on
multiple databases. Moreover, while recent models show higher performance in
matched conditions, their performance substantially decreases in mismatched
conditions and can become inferior to that of the FFNN-based system.
Related papers
- On the Condition Monitoring of Bolted Joints through Acoustic Emission and Deep Transfer Learning: Generalization, Ordinal Loss and Super-Convergence [0.12289361708127876]
This paper investigates the use of deep transfer learning based on convolutional neural networks (CNNs) to monitor bolted joints using acoustic emissions.
We evaluate the performance of our methodology using the ORION-AE benchmark, a structure composed of two thin beams connected by three bolts.
arXiv Detail & Related papers (2024-05-29T13:07:21Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge [19.810337081901178]
Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals.
This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain.
The UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain.
arXiv Detail & Related papers (2024-02-02T13:45:42Z) - Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler [16.13996677489119]
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully.
Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models.
We show that a proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
arXiv Detail & Related papers (2023-12-05T11:40:38Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality
Assessment [12.144133923535714]
This paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric.
It can predict room acoustics parameters alongside the overall mean opinion score (MOS) for speech quality.
We also show that this joint training method enhances the blind estimation of room acoustics.
arXiv Detail & Related papers (2022-04-04T09:38:15Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.