Bring the Noise: Introducing Noise Robustness to Pretrained Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2309.02145v1
- Date: Tue, 5 Sep 2023 11:34:21 GMT
- Title: Bring the Noise: Introducing Noise Robustness to Pretrained Automatic
Speech Recognition
- Authors: Patrick Eickhoff, Matthias M\"oller, Theresa Pekarek Rosin, Johannes
Twiefel, Stefan Wermter
- Abstract summary: We propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture.
We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs.
We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions.
- Score: 13.53738829631595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent research, in the domain of speech processing, large End-to-End
(E2E) systems for Automatic Speech Recognition (ASR) have reported
state-of-the-art performance on various benchmarks. These systems intrinsically
learn how to handle and remove noise conditions from speech. Previous research
has shown, that it is possible to extract the denoising capabilities of these
models into a preprocessor network, which can be used as a frontend for
downstream ASR models. However, the proposed methods were limited to specific
fully convolutional architectures. In this work, we propose a novel method to
extract the denoising capabilities, that can be applied to any encoder-decoder
architecture. We propose the Cleancoder preprocessor architecture that extracts
hidden activations from the Conformer ASR model and feeds them to a decoder to
predict denoised spectrograms. We train our pre-processor on the Noisy Speech
Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we
evaluate our model as a frontend to a pretrained Conformer ASR model as well as
a frontend to train smaller Conformer ASR models from scratch. We show that the
Cleancoder is able to filter noise from speech and that it improves the total
Word Error Rate (WER) of the downstream model in noisy conditions for both
applications.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR [35.710735895190844]
We propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR.
During finetuning, we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations.
Experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions.
arXiv Detail & Related papers (2023-04-11T04:46:12Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Variational Autoencoder for Speech Enhancement with a Noise-Aware
Encoder [30.318947721658862]
We propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs.
We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters.
arXiv Detail & Related papers (2021-02-17T11:40:42Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating
Back-Propagation for Saliency Detection [54.98042023365694]
We propose a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples.
The proposed model consists of two sub-models parameterized by neural networks.
arXiv Detail & Related papers (2020-07-23T18:47:36Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.