Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain
Data
- URL: http://arxiv.org/abs/2203.15321v1
- Date: Tue, 29 Mar 2022 08:06:01 GMT
- Title: Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain
Data
- Authors: Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng
- Abstract summary: We propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN)
We also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions.
- Score: 24.512424190830828
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Noise-robust speech recognition systems require large amounts of training
data including noisy speech data and corresponding transcripts to achieve
state-of-the-art performances in face of various practical environments.
However, such plenty of in-domain data is not always available in the real-life
world. In this paper, we propose a generative adversarial network to simulate
noisy spectrum from the clean spectrum (Simu-GAN), where only 10 minutes of
unparalleled in-domain noisy speech data is required as labels. Furthermore, we
also propose a dual-path speech recognition system to improve the robustness of
the system under noisy conditions. Experimental results show that the proposed
speech recognition system achieves 7.3% absolute improvement with simulated
noisy data by Simu-GAN over the best baseline in terms of word error rate
(WER).
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Improving Speech Recognition on Noisy Speech via Speech Enhancement with
Multi-Discriminators CycleGAN [41.88097793717185]
We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech.
We show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data.
arXiv Detail & Related papers (2021-12-12T19:56:34Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Incorporating Broad Phonetic Information for Speech Enhancement [23.12902068334228]
In noisy conditions, knowing speech contents facilitates listeners to more effectively suppress background noise components.
Previous studies have confirmed the benefits of incorporating phonetic information in a speech enhancement system.
This study proposes to incorporate the broad phonetic class (BPC) information into the SE process.
arXiv Detail & Related papers (2020-08-13T09:38:08Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.