A Training and Inference Strategy Using Noisy and Enhanced Speech as
Target for Speech Enhancement without Clean Speech
- URL: http://arxiv.org/abs/2210.15368v3
- Date: Mon, 22 May 2023 14:02:35 GMT
- Title: A Training and Inference Strategy Using Noisy and Enhanced Speech as
Target for Speech Enhancement without Clean Speech
- Authors: Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
- Abstract summary: We propose a training and inference strategy that additionally uses enhanced speech as a target.
Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing.
Experimental results show that our proposed method outperforms several baselines.
- Score: 24.036987059698415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of clean speech is a practical challenge to the development of
speech enhancement systems, which means that there is an inevitable mismatch
between their training criterion and evaluation metric. In response to this
unfavorable situation, we propose a training and inference strategy that
additionally uses enhanced speech as a target by improving the previously
proposed noisy-target training (NyTT). Because homogeneity between in-domain
noise and extraneous noise is the key to the effectiveness of NyTT, we train
various student models by remixing 1) the teacher model's estimated speech and
noise for enhanced-target training or 2) raw noisy speech and the teacher
model's estimated noise for noisy-target training. Experimental results show
that our proposed method outperforms several baselines, especially with the
teacher/student inference, where predicted clean speech is derived successively
through the teacher and final student models.
Related papers
- Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions [25.490988931354185]
We propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and speech emotion recognition (SER)
We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training.
Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method.
arXiv Detail & Related papers (2024-09-29T07:04:50Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only
on noisy/ reverberated speech [28.012465936987013]
We propose MetricGAN-U, which releases the constraint from conventional unsupervised learning.
In MetricGAN-U, only noisy speech is required to train the model by optimizing non-intrusive speech quality metrics.
The experimental results verified that MetricGAN-U outperforms baselines in both objective and subjective metrics.
arXiv Detail & Related papers (2021-10-12T10:01:32Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.