Related papers: EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

URL: http://arxiv.org/abs/2505.19693v2
Date: Fri, 17 Oct 2025 01:34:55 GMT
Title: EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification
Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee,
Abstract summary: We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression.<n>In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to.
Score: 49.128847336227636
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.

Related papers

Reinforcement Learning Enhancement Using Vector Semantic Representation and Symbolic Reasoning for Human-Centered Autonomous Emergency Braking [4.3152045411139675]
This paper proposes a novel pipeline that produces a neuro-symbolic feature representation that encompasses semantic, spatial, and shape information.<n>It also proposes a Soft First-Order Logic (SFOL) reward function that balances human values via a symbolic reasoning module.<n>The findings demonstrate that integrating holistic representations and soft reasoning into Reinforcement Learning can support more context-aware and value-aligned decision-making for autonomous driving.
arXiv Detail & Related papers (2026-02-04T21:56:27Z)
Local-Global Feature Fusion for Subject-Independent EEG Emotion Recognition [5.248014945077116]
We propose a fusion framework that integrates (i) local, channel-wise descriptors and (ii) global, trial-level descriptors.<n>Experiments under a leave-one-subject-out protocol demonstrate that the proposed method consistently outperforms single-view and classical baselines.
arXiv Detail & Related papers (2026-01-13T00:27:01Z)
Decoding Visual Neural Representations by Multimodal with Dynamic Balancing [8.355081324607537]
We propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals.<n>We introduce text modality to enhance the semantic correspondence between EEG signals and visual content.<n>Our method surpasses previous state-of-the-art methods in both Top-1 and Top-5 accuracy metrics, improving by 2.0% and 4.7% respectively.
arXiv Detail & Related papers (2025-09-03T16:03:59Z)
Domain Generalization using Action Sequences for Egocentric Action Recognition [22.373604443667134]
Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment.<n>We propose a domain generalization approach for Egocentric Action Recognition.<n>By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments.
arXiv Detail & Related papers (2025-06-21T11:33:08Z)
CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z)
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model [0.8747606955991707]
We propose a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment.<n>SegVLM shows strong generalization across diverse datasets and referring expression scenarios.
arXiv Detail & Related papers (2025-05-25T17:42:53Z)
Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy [8.527959937101826]
We train a neural network to produce the variational posterior of a collection of Bernoulli random variables. We modify the prosodic features of a masked segment to increase the score of target emotion. Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target.
arXiv Detail & Related papers (2024-08-04T00:47:29Z)
KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter [49.85369344101118]
We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering. Our key idea is to model the dynamic radiance field as a dynamic system whose temporally varying states are estimated based on two sources of knowledge: observations and predictions. Our KFD-NeRF demonstrates similar or even superior performance within comparable computational time and state-of-the-art view synthesis performance with thorough training.
arXiv Detail & Related papers (2024-07-18T05:48:24Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain. We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features. DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z)
IDA: Informed Domain Adaptive Semantic Segmentation [51.12107564372869]
We propose an Domain Informed Adaptation (IDA) model, a self-training framework that mixes the data based on class-level segmentation performance. In our IDA model, the class-level performance is tracked by an expected confidence score (ECS) and we then use a dynamic schedule to determine the mixing ratio for data in different domains. Our proposed method is able to outperform the state-of-the-art UDA-SS method by a margin of 1.1 mIoU in the adaptation of GTA-V to Cityscapes and of 0.9 mIoU in the adaptation of SYNTHIA to City
arXiv Detail & Related papers (2023-03-05T18:16:34Z)
Incorporating Dynamic Semantics into Pre-Trained Language Model for Aspect-based Sentiment Analysis [67.41078214475341]
We propose Dynamic Re-weighting BERT (DR-BERT) to learn dynamic aspect-oriented semantics for ABSA. Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence. We then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA)
arXiv Detail & Related papers (2022-03-30T14:48:46Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Correlation based Multi-phasal models for improved imagined speech EEG recognition [22.196642357767338]
This work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units. A bi-phase common representation learning module using neural networks is designed to model the correlation and between an analysis phase and a support phase. The proposed approach further handles the non-availability of multi-phasal data during decoding.
arXiv Detail & Related papers (2020-11-04T09:39:53Z)
Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution [27.48620879003556]
We present a new learning framework that automatically learns the latent relationships of AUs via establishing semantic correspondences between feature maps. In the heatmap regression-based network, feature maps preserve rich semantic information associated with AU intensities and locations. This motivates us to model the correlation among feature channels, which implicitly represents the co-occurrence relationship of AU intensity levels.
arXiv Detail & Related papers (2020-04-20T23:55:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.