On the Effectiveness of ASR Representations in Real-world Noisy Speech
Emotion Recognition
- URL: http://arxiv.org/abs/2311.07093v2
- Date: Tue, 14 Nov 2023 13:09:51 GMT
- Title: On the Effectiveness of ASR Representations in Real-world Noisy Speech
Emotion Recognition
- Authors: Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda
- Abstract summary: We propose an efficient attempt to noisy speech emotion recognition (NSER)
We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech.
Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
- Score: 26.013815255299342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an efficient attempt to noisy speech emotion recognition
(NSER). Conventional NSER approaches have proven effective in mitigating the
impact of artificial noise sources, such as white Gaussian noise, but are
limited to non-stationary noises in real-world environments due to their
complexity and uncertainty. To overcome this limitation, we introduce a new
method for NSER by adopting the automatic speech recognition (ASR) model as a
noise-robust feature extractor to eliminate non-vocal information in noisy
speech. We first obtain intermediate layer information from the ASR model as a
feature representation for emotional speech and then apply this representation
for the downstream NSER task. Our experimental results show that 1) the
proposed method achieves better NSER performance compared with the conventional
noise reduction method, 2) outperforms self-supervised learning approaches, and
3) even outperforms text-based approaches using ASR transcription or the ground
truth transcription of noisy speech.
Related papers
- TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
Speech Emotion Recognition (SER) is subject to ubiquitous environmental noise.
We introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge.
We show that TRNet substantially increases the system's robustness in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.