TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
- URL: http://arxiv.org/abs/2404.12979v1
- Date: Fri, 19 Apr 2024 16:09:17 GMT
- Title: TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
- Authors: Chengxin Chen, Pengyuan Zhang,
- Abstract summary: Speech Emotion Recognition (SER) is subject to ubiquitous environmental noise.
We introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge.
We show that TRNet substantially increases the system's robustness in both matched and unmatched noisy environments.
- Score: 29.756961194844717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in diminished SER performance in practical use. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially increases the system's robustness in both matched and unmatched noisy environments, without compromising its performance in clean environments.
Related papers
- Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation [29.789809751108304]
We propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST)
Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment.
arXiv Detail & Related papers (2024-06-04T19:22:13Z) - Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR [35.710735895190844]
We propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR.
During finetuning, we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations.
Experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions.
arXiv Detail & Related papers (2023-04-11T04:46:12Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.