Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition
- URL: http://arxiv.org/abs/2203.14838v3
- Date: Sat, 27 May 2023 11:24:51 GMT
- Title: Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition
- Authors: Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng
- Abstract summary: Speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information.
We propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR)
Experiments show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline.
- Score: 26.77806246793544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) systems degrade significantly under noisy
conditions. Recently, speech enhancement (SE) is introduced as front-end to
reduce noise for ASR, but it also suppresses some important speech information,
i.e., over-suppression. To alleviate this, we propose a dual-path style
learning approach for end-to-end noise-robust speech recognition (DPSL-ASR).
Specifically, we first introduce clean speech feature along with the fused
feature from IFF-Net as dual-path inputs to recover the suppressed information.
Then, we propose style learning to map the fused feature close to clean
feature, in order to learn latent speech information from the latter, i.e.,
clean "speech style". Furthermore, we also minimize the distance of final ASR
outputs in two paths to improve noise-robustness. Experiments show that the
proposed approach achieves relative word error rate (WER) reductions of 10.6%
and 8.6% over the best IFF-Net baseline, on RATS and CHiME-4 datasets
respectively.
Related papers
- Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR [35.710735895190844]
We propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR.
During finetuning, we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations.
Experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions.
arXiv Detail & Related papers (2023-04-11T04:46:12Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Interactive Feature Fusion for End-to-End Noise-Robust Speech
Recognition [25.84784710031567]
We propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition.
Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline.
Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
arXiv Detail & Related papers (2021-10-11T13:40:07Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.