Related papers: Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

URL: http://arxiv.org/abs/2406.08353v2
Date: Fri, 13 Sep 2024 01:48:15 GMT
Title: Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Authors: Yuanchao Li, Peter Bell, Catherine Lai,
Abstract summary: This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora. We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
Score: 17.166092544686553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.

Related papers

Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC) We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon. We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z)
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction [23.812838405442953]
We introduce a novel multi-modal fusion method to learn shared representations across modalities. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1%.
arXiv Detail & Related papers (2024-01-24T06:55:55Z)
Boosting Punctuation Restoration with Data Generation and Reinforcement Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR) The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z)
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition [12.437708240244756]
We analyze how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora. We conduct text-based Speech Emotion Recognition on ASR transcripts with increasing word error rates to investigate how ASR affects SER.
arXiv Detail & Related papers (2023-05-25T13:56:09Z)
RED-ACE: Robust Error Detection for ASR using Confidence Embeddings [5.4693121539705984]
We propose to utilize the ASR system's word-level confidence scores for improving AED performance. We add an ASR Confidence Embedding layer to the AED model's encoder, allowing us to jointly encode the confidence scores and the transcribed text into a contextualized representation.
arXiv Detail & Related papers (2022-03-14T15:13:52Z)
Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS) ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z)
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition [14.35400087127149]
We propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training Speech Emotion Recognition (SER) In joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. We also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
arXiv Detail & Related papers (2021-10-29T11:21:17Z)
Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z)
Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow. We propose CADNet, a novel contextualized attention-based distillation approach. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.