Related papers: Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

URL: http://arxiv.org/abs/2110.05267v1
Date: Mon, 11 Oct 2021 13:40:07 GMT
Title: Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition
Authors: Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng
Abstract summary: We propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition. Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline. Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
Score: 25.84784710031567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility. However, the over-suppression phenomenon in the enhanced speech might degrade the performance of downstream automatic speech recognition (ASR) task due to the missing latent information. To alleviate such problem, we propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition to learn complementary information from the enhanced feature and original noisy feature. Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline on RATS Channel-A corpus. Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.

Related papers

Active Speech Enhancement: Active Speech Denoising Decliping and Deveraberation [13.575063025878208]
We introduce a new paradigm for active sound modification: Active Speech Enhancement (ASE)<n>We propose a novel Transformer-Mamba-based architecture, along with a task-specific loss function designed to jointly optimize interference suppression and signal enrichment.<n>Our method outperforms existing baselines across multiple speech processing tasks -- including denoising, dereverberation, and declipping.
arXiv Detail & Related papers (2025-05-22T17:10:18Z)
TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments. Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z)
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition [26.013815255299342]
We propose an efficient attempt to noisy speech emotion recognition (NSER) We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
arXiv Detail & Related papers (2023-11-13T05:45:55Z)
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper. Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition [26.77806246793544]
Speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information. We propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR) Experiments show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline.
arXiv Detail & Related papers (2022-03-28T15:21:57Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z)
High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.