Speech-enhanced and Noise-aware Networks for Robust Speech Recognition
- URL: http://arxiv.org/abs/2203.13696v1
- Date: Fri, 25 Mar 2022 15:04:51 GMT
- Title: Speech-enhanced and Noise-aware Networks for Robust Speech Recognition
- Authors: Hung-Shin Lee, Pin-Yuan Chen, Yu Tsao, Hsin-Min Wang
- Abstract summary: A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
- Score: 25.279902171523233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compensation for channel mismatch and noise interference is essential for
robust automatic speech recognition. Enhanced speech has been introduced into
the multi-condition training of acoustic models to improve their generalization
ability. In this paper, a noise-aware training framework based on two cascaded
neural structures is proposed to jointly optimize speech enhancement and speech
recognition. The feature enhancement module is composed of a multi-task
autoencoder, where noisy speech is decomposed into clean speech and noise. By
concatenating its enhanced, noise-aware, and noisy features for each frame, the
acoustic-modeling module maps each feature-augmented frame into a triphone
state by optimizing the lattice-free maximum mutual information and cross
entropy between the predicted and actual state sequences. On top of the
factorized time delay neural network (TDNN-F) and its convolutional variant
(CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error
rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared
with the best existing systems that use bigram and trigram language models for
decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction
of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based
system also outperforms the baseline CNN-TDNNF system on the AMI task.
Related papers
- LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement [4.891339883978289]
We propose long short term memory speech enhancement network (LSTMSE-Net)
This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals.
The system scales and highlights visual and audio features, then surpasses them through a separator network for optimized speech enhancement.
arXiv Detail & Related papers (2024-09-03T19:52:49Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Three-class Overlapped Speech Detection using a Convolutional Recurrent
Neural Network [32.59704287230343]
The proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech.
A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information.
The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set.
arXiv Detail & Related papers (2021-04-07T03:01:34Z) - WaDeNet: Wavelet Decomposition based CNN for Speech Processing [0.0]
WaDeNet is an end-to-end model for mobile speech processing.
WaDeNet embeds wavelet decomposition of the speech signal within the architecture.
arXiv Detail & Related papers (2020-11-11T06:43:03Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.