Related papers: Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

URL: http://arxiv.org/abs/2308.14359v3
Date: Wed, 27 Sep 2023 18:00:54 GMT
Title: Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks
Authors: Payal Mohapatra, Akash Pandey, Yueyuan Sui, Qi Zhu
Abstract summary: We look at speech emotion understanding as a perception task which is a more realistic setting. We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
Score: 3.570593982494095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.

Related papers

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech [0.13048920509133805]
We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
arXiv Detail & Related papers (2025-10-29T00:45:36Z)
Switchboard-Affect: Emotion Perception Labels from Conversational Speech [7.576840738395629]
We identify the Switchboard corpus as a promising source of naturalistic conversational speech.<n>We train a crowd to label the dataset for categorical emotions and dimensional attributes.<n>We evaluate state-of-the-art SER models and find variable performance across the emotion categories with especially poor generalization.
arXiv Detail & Related papers (2025-10-14T21:23:04Z)
Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition [58.74986434825755]
Cross-lingual speech emotion recognition is a challenging task due to differences in phonetic variability and speaker-specific expressive styles.<n>We propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels.<n>Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits.
arXiv Detail & Related papers (2025-09-19T21:03:21Z)
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [37.62433475609052]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages [31.15696076055884]
We propose leveraging contrastive learning to refine multilingual speech features and extend large language models. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER.
arXiv Detail & Related papers (2025-03-25T05:58:18Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition. Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z)
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages. Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues. It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z)
Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition. We investigated this approach on both speech transcriptions and acoustic segments. According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z)
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space. We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z)
deep learning of segment-level feature representation for speech emotion recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.