Related papers: Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

URL: http://arxiv.org/abs/2509.20373v1
Date: Fri, 19 Sep 2025 21:03:21 GMT
Title: Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition
Authors: Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee,
Abstract summary: Cross-lingual speech emotion recognition is a challenging task due to differences in phonetic variability and speaker-specific expressive styles.<n>We propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels.<n>Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits.
Score: 58.74986434825755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

Related papers

Marco-Voice Technical Report [35.01600797874603]
The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation.<n>Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning.<n>To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset.
arXiv Detail & Related papers (2025-08-04T04:08:22Z)
Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages [31.15696076055884]
We propose leveraging contrastive learning to refine multilingual speech features and extend large language models.<n>Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space.<n>To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER.
arXiv Detail & Related papers (2025-03-25T05:58:18Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting. We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z)
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space. We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z)
Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech Emotion Recognition [48.29355616574199]
We analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese, and Cantonese. This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method.
arXiv Detail & Related papers (2023-06-26T08:48:08Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.