Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction
- URL: http://arxiv.org/abs/2110.15430v1
- Date: Thu, 28 Oct 2021 20:39:02 GMT
- Title: Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction
- Authors: Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie
Liu, Takuya Yoshioka, Jinyu Li and DeLiang Wang
- Abstract summary: Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
- Score: 109.44933866397123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Noise robustness is essential for deploying automatic speech recognition
(ASR) systems in real-world environments. One way to reduce the effect of noise
interference is to employ a preprocessing module that conducts speech
enhancement, and then feed the enhanced speech to an ASR backend. In this work,
instead of suppressing background noise with a conventional cascaded pipeline,
we employ a noise-robust representation learned by a refined self-supervised
framework for noisy speech recognition. We propose to combine a reconstruction
module with contrastive learning and perform multi-task continual pre-training
on noisy data. The reconstruction module is used for auxiliary learning to
improve the noise robustness of the learned representation and thus is not
required during inference. Experiments demonstrate the effectiveness of our
proposed method. Our model substantially reduces the word error rate (WER) for
the synthesized noisy LibriSpeech test sets, and yields around 4.1/7.5% WER
reduction on noisy clean/other test sets compared to data augmentation. For the
real-world noisy speech from the CHiME-4 challenge (1-channel track), we have
obtained the state of the art ASR performance without any denoising front-end.
Moreover, we achieve comparable performance to the best supervised approach
reported with only 16% of labeled data.
Related papers
- TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - On the Effectiveness of ASR Representations in Real-world Noisy Speech
Emotion Recognition [26.013815255299342]
We propose an efficient attempt to noisy speech emotion recognition (NSER)
We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech.
Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
arXiv Detail & Related papers (2023-11-13T05:45:55Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.