ASR and Emotional Speech: A Word-Level Investigation of the Mutual
Impact of Speech and Emotion Recognition
- URL: http://arxiv.org/abs/2305.16065v2
- Date: Sun, 28 May 2023 17:26:58 GMT
- Title: ASR and Emotional Speech: A Word-Level Investigation of the Mutual
Impact of Speech and Emotion Recognition
- Authors: Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai
- Abstract summary: We analyze how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora.
We conduct text-based Speech Emotion Recognition on ASR transcripts with increasing word error rates to investigate how ASR affects SER.
- Score: 12.437708240244756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Speech Emotion Recognition (SER), textual data is often used alongside
audio signals to address their inherent variability. However, the reliance on
human annotated text in most research hinders the development of practical SER
systems. To overcome this challenge, we investigate how Automatic Speech
Recognition (ASR) performs on emotional speech by analyzing the ASR performance
on emotion corpora and examining the distribution of word errors and confidence
scores in ASR transcripts to gain insight into how emotion affects ASR. We
utilize four ASR systems, namely Kaldi ASR, wav2vec2, Conformer, and Whisper,
and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability.
Additionally, we conduct text-based SER on ASR transcripts with increasing word
error rates to investigate how ASR affects SER. The objective of this study is
to uncover the relationship and mutual impact of ASR and SER, in order to
facilitate ASR adaptation to emotional speech and the use of SER in real world.
Related papers
- Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Hey ASR System! Why Aren't You More Inclusive? Automatic Speech
Recognition Systems' Bias and Proposed Bias Mitigation Techniques. A
Literature Review [0.0]
We present research that addresses ASR biases against gender, race, and the sick and disabled.
We also discuss techniques for designing a more accessible and inclusive ASR technology.
arXiv Detail & Related papers (2022-11-17T13:15:58Z) - Can Visual Context Improve Automatic Speech Recognition for an Embodied
Agent? [3.7311680121118345]
We propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context.
We achieve a 59% relative reduction in WER from an unmodified ASR system.
arXiv Detail & Related papers (2022-10-21T11:16:05Z) - Fusing ASR Outputs in Joint Training for Speech Emotion Recognition [14.35400087127149]
We propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training Speech Emotion Recognition (SER)
In joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most.
We also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
arXiv Detail & Related papers (2021-10-29T11:21:17Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Quantifying Bias in Automatic Speech Recognition [28.301997555189462]
This paper quantifies the bias of a Dutch SotA ASR system against gender, age, regional accents and non-native accents.
Based on our findings, we suggest bias mitigation strategies for ASR development.
arXiv Detail & Related papers (2021-03-28T12:52:03Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - Contextualized Attention-based Knowledge Transfer for Spoken
Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow.
We propose CADNet, a novel contextualized attention-based distillation approach.
We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.