On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition
- URL: http://arxiv.org/abs/2305.12540v2
- Date: Thu, 25 May 2023 19:33:54 GMT
- Title: On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition
- Authors: Lokesh Bansal, S. Pavankumar Dubagunta, Malolan Chetlur, Pushpak
Jagtap, Aravind Ganapathiraju
- Abstract summary: We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
- Score: 6.006652562747009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New-age conversational agent systems perform both speech emotion recognition
(SER) and automatic speech recognition (ASR) using two separate and often
independent approaches for real-world application in noisy environments. In
this paper, we investigate a joint ASR-SER multitask learning approach in a
low-resource setting and show that improvements are observed not only in SER,
but also in ASR. We also investigate the robustness of such jointly trained
models to the presence of background noise, babble, and music. Experimental
results on the IEMOCAP dataset show that joint learning can improve ASR word
error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively
in clean scenarios. In noisy scenarios, results on data augmented with MUSAN
show that the joint approach outperforms the independent ASR and SER approaches
across many noisy conditions. Overall, the joint ASR-SER approach yielded more
noise-resistant models than the independent ASR and SER approaches.
Related papers
- Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance [7.882996636086014]
It is important that automatic speech recognition (ASR) models and their use is fair and equitable.
The current study seeks to understand the factors underlying this disparity by examining the performance of the current state-of-the-art neural network based ASR system.
arXiv Detail & Related papers (2024-07-19T02:14:17Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - On the Effectiveness of ASR Representations in Real-world Noisy Speech
Emotion Recognition [26.013815255299342]
We propose an efficient attempt to noisy speech emotion recognition (NSER)
We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech.
Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
arXiv Detail & Related papers (2023-11-13T05:45:55Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Fusing ASR Outputs in Joint Training for Speech Emotion Recognition [14.35400087127149]
We propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training Speech Emotion Recognition (SER)
In joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most.
We also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
arXiv Detail & Related papers (2021-10-29T11:21:17Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language
Understanding [42.80343041535763]
The robustness of natural language understanding systems to errors introduced by automatic speech recognition (ASR) is under-examined.
We propose ASR-GLUE benchmark, a new collection of 6 different NLU tasks for evaluating the performance of models under ASR error.
arXiv Detail & Related papers (2021-08-30T08:11:39Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.