Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2011.04249v1
- Date: Mon, 9 Nov 2020 08:52:05 GMT
- Title: Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition
- Authors: Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Bin Liu, Zhengqi
Wen
- Abstract summary: This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
- Score: 64.9317368575585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The joint training framework for speech enhancement and recognition methods
have obtained quite good performances for robust end-to-end automatic speech
recognition (ASR). However, these methods only utilize the enhanced feature as
the input of the speech recognition component, which are affected by the speech
distortion problem. In order to address this problem, this paper proposes a
gated recurrent fusion (GRF) method with joint training framework for robust
end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and
enhanced features. Therefore, the GRF can not only remove the noise signals
from the enhanced features, but also learn the raw fine structures from the
noisy features so that it can alleviate the speech distortion. The proposed
method consists of speech enhancement, GRF and speech recognition. Firstly, the
mask based speech enhancement network is applied to enhance the input speech.
Secondly, the GRF is applied to address the speech distortion problem. Thirdly,
to improve the performance of ASR, the state-of-the-art speech transformer
algorithm is used as the speech recognition component. Finally, the joint
training framework is utilized to optimize these three components,
simultaneously. Our experiments are conducted on an open-source Mandarin speech
corpus called AISHELL-1. Experimental results show that the proposed method
achieves the relative character error rate (CER) reduction of 10.04\% over the
conventional joint enhancement and transformer method only using the enhanced
features. Especially for the low signal-to-noise ratio (0 dB), our proposed
method can achieves better performances with 12.67\% CER reduction, which
suggests the potential of our proposed method.
Related papers
- Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition [26.77806246793544]
Speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information.
We propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR)
Experiments show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline.
arXiv Detail & Related papers (2022-03-28T15:21:57Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Interactive Feature Fusion for End-to-End Noise-Robust Speech
Recognition [25.84784710031567]
We propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition.
Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline.
Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
arXiv Detail & Related papers (2021-10-11T13:40:07Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.