Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2110.15684v1
- Date: Fri, 29 Oct 2021 11:21:17 GMT
- Title: Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
- Authors: Yuanchao Li, Peter Bell, Catherine Lai
- Abstract summary: We propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training Speech Emotion Recognition (SER)
In joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most.
We also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
- Score: 14.35400087127149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alongside acoustic information, linguistic features based on speech
transcripts have been proven useful in Speech Emotion Recognition (SER).
However, due to the scarcity of emotion labelled data and the difficulty of
recognizing emotional speech, it is hard to obtain reliable linguistic features
and models in this research area. In this paper, we propose to fuse Automatic
Speech Recognition (ASR) outputs into the pipeline for joint training SER. The
relationship between ASR and SER is understudied, and it is unclear what and
how ASR features benefit SER. By examining various ASR outputs and fusion
methods, our experiments show that in joint ASR-SER training, incorporating
both ASR hidden and text output using a hierarchical co-attention fusion
approach improves the SER performance the most. On the IEMOCAP corpus, our
approach achieves 63.4% weighted accuracy, which is close to the baseline
results achieved by combining ground-truth transcripts. In addition, we also
present novel word error rate analysis on IEMOCAP and layer-difference analysis
of the Wav2vec 2.0 model to better understand the relationship between ASR and
SER.
Related papers
- Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction [23.812838405442953]
We introduce a novel multi-modal fusion method to learn shared representations across modalities.
Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1%.
arXiv Detail & Related papers (2024-01-24T06:55:55Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - ASR and Emotional Speech: A Word-Level Investigation of the Mutual
Impact of Speech and Emotion Recognition [12.437708240244756]
We analyze how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora.
We conduct text-based Speech Emotion Recognition on ASR transcripts with increasing word error rates to investigate how ASR affects SER.
arXiv Detail & Related papers (2023-05-25T13:56:09Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Contextualized Attention-based Knowledge Transfer for Spoken
Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow.
We propose CADNet, a novel contextualized attention-based distillation approach.
We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.