Related papers: Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

URL: http://arxiv.org/abs/2306.17500v1
Date: Fri, 30 Jun 2023 09:21:48 GMT
Title: Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition
Authors: Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain
Abstract summary: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach.
Score: 28.114873457383354
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within speech are not discrete events, therefore the perceived emotion state should also be distributed over potentially continuous time-windows. This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach. The benefits of using a distributed approach to speech emotion understanding are supported by the results of cross-corpora analysis experiments. Experiments where phones and words are mapped to the attention vectors along with the fundamental frequency to observe the overlapping distributions and thereby the relationship between acoustic context and emotion. This work aims to bridge psycholinguistic theory research with computational modelling for SER.

Related papers

Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [37.62433475609052]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z)
Exploiting Emotion-Semantic Correlations for Empathetic Response Generation [18.284296904390143]
Empathetic response generation aims to generate empathetic responses by understanding the speaker's emotional feelings from the language of dialogue. Recent methods capture emotional words in the language of communicators and construct them as static vectors to perceive nuanced emotions. We propose a dynamical Emotion-Semantic Correlation Model (ESCM) for empathetic dialogue generation tasks.
arXiv Detail & Related papers (2024-02-27T11:50:05Z)
Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition. We investigated this approach on both speech transcriptions and acoustic segments. According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z)
deep learning of segment-level feature representation for speech emotion recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z)
Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z)
Detecting Emotion Carriers by Combining Acoustic and Lexical Representations [7.225325393598648]
We focus on Emotion Carriers (EC) defined as the segments that best explain the emotional state of the narrator. EC can provide a richer representation of the user state to improve natural language understanding. We leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives.
arXiv Detail & Related papers (2021-12-13T12:39:53Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.