Improving Speech Recognition on Noisy Speech via Speech Enhancement with
Multi-Discriminators CycleGAN
- URL: http://arxiv.org/abs/2112.06309v1
- Date: Sun, 12 Dec 2021 19:56:34 GMT
- Title: Improving Speech Recognition on Noisy Speech via Speech Enhancement with
Multi-Discriminators CycleGAN
- Authors: Chia-Yu Li and Ngoc Thang Vu
- Abstract summary: We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech.
We show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data.
- Score: 41.88097793717185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our latest investigations on improving automatic speech
recognition for noisy speech via speech enhancement. We propose a novel method
named Multi-discriminators CycleGAN to reduce noise of input speech and
therefore improve the automatic speech recognition performance. Our proposed
method leverages the CycleGAN framework for speech enhancement without any
parallel data and improve it by introducing multiple discriminators that check
different frequency areas. Furthermore, we show that training multiple
generators on homogeneous subset of the training data is better than training
one generator on all the training data. We evaluate our method on CHiME-3 data
set and observe up to 10.03% relatively WER improvement on the development set
and up to 14.09% on the evaluation set.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain
Data [24.512424190830828]
We propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN)
We also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions.
arXiv Detail & Related papers (2022-03-29T08:06:01Z) - Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples.
We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.