Related papers: On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

URL: http://arxiv.org/abs/2311.07093v2
Date: Tue, 14 Nov 2023 13:09:51 GMT
Title: On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition
Authors: Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda
Abstract summary: We propose an efficient attempt to noisy speech emotion recognition (NSER) We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
Score: 26.013815255299342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.

Related papers

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z)
TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments. Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z)
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR) In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z)
Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process. A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules. The errors of the ASR system can seriously downgrade the performance of the NLP modules. Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features. At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features. At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance. We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.