Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge
- URL: http://arxiv.org/abs/2402.01413v2
- Date: Wed, 10 Jul 2024 11:16:08 GMT
- Title: Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge
- Authors: Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, LĂ©onie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker,
- Abstract summary: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals.
This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain.
The UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain.
- Score: 19.810337081901178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.
Related papers
- Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments [0.7366405857677227]
Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or room impulse response database.
The present study introduces a generalization assessment framework that uses a reference model trained on the test condition.
The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), ConvTasNet, DCCRN and MANNER.
arXiv Detail & Related papers (2023-09-12T12:51:12Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets,
Subjective Testing Framework, and Challenge Results [27.074806625047646]
DNS Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement.
We open-sourced a large clean speech and noise corpus for training the noise suppression models.
We also open-sourced an online subjective test framework based on ITU-T P.808 for researchers to reliably test their developments.
arXiv Detail & Related papers (2020-05-16T23:48:37Z) - The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets,
Subjective Speech Quality and Testing Framework [27.074806625047646]
The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement.
We open-source a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios.
The winners of this challenge will be selected based on subjective evaluation on a representative test set using P.808 framework.
arXiv Detail & Related papers (2020-01-23T17:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.