Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning
- URL: http://arxiv.org/abs/2505.16220v1
- Date: Thu, 22 May 2025 04:44:20 GMT
- Title: Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning
- Authors: Liang-Yeh Shen, Shi-Xin Fang, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee,
- Abstract summary: This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER)<n>By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles.<n>Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios.
- Score: 45.925209699021124
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
Related papers
- Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation [58.189703277322224]
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion.<n>Emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models.<n>We propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation.
arXiv Detail & Related papers (2025-04-08T04:34:38Z) - BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation [29.514459004019024]
We propose a behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors into a vanilla MLLM-based MERC model.<n>BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets.
arXiv Detail & Related papers (2025-03-31T12:04:53Z) - MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition [7.81011775615268]
We introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER.
Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes.
Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet achieves superior performance compared to state-of-the-art SER approaches.
arXiv Detail & Related papers (2023-08-08T03:43:24Z) - Contrastive Meta-Learning for Partially Observable Few-Shot Learning [5.363168481735953]
We consider the problem of learning a unified representation from partial observations, where useful features may be present in only some of the views.
We approach this through a probabilistic formalism enabling views to map to representations with different levels of uncertainty in different components.
Our approach, Partial Observation Experts Modelling (POEM), then enables us to meta-learn consistent representations from partial observations.
arXiv Detail & Related papers (2023-01-30T18:17:24Z) - Rethinking the Learning Paradigm for Facial Expression Recognition [56.050738381526116]
We rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.
This paper argues that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.
arXiv Detail & Related papers (2022-09-30T12:00:54Z) - Sentiment-Aware Automatic Speech Recognition pre-training for enhanced
Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER)
We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks.
We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Meta-Learning with Variational Semantic Memory for Word Sense
Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting.
Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork.
We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.