Empirical Interpretation of the Relationship Between Speech Acoustic
Context and Emotion Recognition
- URL: http://arxiv.org/abs/2306.17500v1
- Date: Fri, 30 Jun 2023 09:21:48 GMT
- Title: Empirical Interpretation of the Relationship Between Speech Acoustic
Context and Emotion Recognition
- Authors: Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain
- Abstract summary: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech.
In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration.
This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach.
- Score: 28.114873457383354
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech emotion recognition (SER) is vital for obtaining emotional
intelligence and understanding the contextual meaning of speech. Variations of
consonant-vowel (CV) phonemic boundaries can enrich acoustic context with
linguistic cues, which impacts SER. In practice, speech emotions are treated as
single labels over an acoustic segment for a given time duration. However,
phone boundaries within speech are not discrete events, therefore the perceived
emotion state should also be distributed over potentially continuous
time-windows.
This research explores the implication of acoustic context and phone
boundaries on local markers for SER using an attention-based approach. The
benefits of using a distributed approach to speech emotion understanding are
supported by the results of cross-corpora analysis experiments. Experiments
where phones and words are mapped to the attention vectors along with the
fundamental frequency to observe the overlapping distributions and thereby the
relationship between acoustic context and emotion. This work aims to bridge
psycholinguistic theory research with computational modelling for SER.
Related papers
- Exploiting Emotion-Semantic Correlations for Empathetic Response
Generation [18.284296904390143]
Empathetic response generation aims to generate empathetic responses by understanding the speaker's emotional feelings from the language of dialogue.
Recent methods capture emotional words in the language of communicators and construct them as static vectors to perceive nuanced emotions.
We propose a dynamical Emotion-Semantic Correlation Model (ESCM) for empathetic dialogue generation tasks.
arXiv Detail & Related papers (2024-02-27T11:50:05Z) - Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition.
We investigated this approach on both speech transcriptions and acoustic segments.
According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Detecting Emotion Carriers by Combining Acoustic and Lexical
Representations [7.225325393598648]
We focus on Emotion Carriers (EC) defined as the segments that best explain the emotional state of the narrator.
EC can provide a richer representation of the user state to improve natural language understanding.
We leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives.
arXiv Detail & Related papers (2021-12-13T12:39:53Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.