Related papers: Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

URL: http://arxiv.org/abs/2412.19200v1
Date: Thu, 26 Dec 2024 12:47:35 GMT
Title: Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning
Authors: Dengming Zhang, Weitao You, Ziheng Liu, Lingyun Sun, Pei Chen,
Abstract summary: We propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method for Dynamic Music Emotion Recognition (DMER)<n>Our method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies.<n>Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.
Score: 15.506299212817034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.

Related papers

Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition [56.00118641432005]
We propose a Memory-guided Prototypical Co-occurrence Learning framework that explicitly models emotion co-occurrence patterns.<n>Inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations.<n>Our model learns affectively informative representations for accurate emotion distribution prediction.
arXiv Detail & Related papers (2026-02-24T04:11:25Z)
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition [31.4260327895046]
Multimodal Emotion Recognition aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data.<n>Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts.<n>We propose Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception.
arXiv Detail & Related papers (2025-11-19T03:49:22Z)
Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection [35.78392011537934]
The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue.<n>Existing methods typically rely on semantic matching and model emotional and intentional cues separately.<n>EIGML is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling.
arXiv Detail & Related papers (2025-11-16T16:11:48Z)
A Study on the Data Distribution Gap in Music Emotion Recognition [7.281487567929003]
Music Emotion Recognition (MER) is a task deeply connected to human perception.<n>Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres.<n>We address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations.
arXiv Detail & Related papers (2025-10-06T10:57:05Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics. We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
M2SE: A Multistage Multitask Instruction Tuning Strategy for Unified Sentiment and Emotion Analysis [5.3848462080869215]
We propose M2SE, a Multistage Multitask Sentiment and Emotion Instruction Tuning Strategy for general-purpose MLLMs.<n>It employs a combined approach to train models on tasks such as multimodal sentiment analysis, emotion recognition, facial expression recognition, emotion reason inference, and emotion cause-pair extraction.<n>Our model, Emotion Universe (EmoVerse), is built on a basic MLLM framework without modifications, yet it achieves substantial improvements across these tasks when trained with the M2SE strategy.
arXiv Detail & Related papers (2024-12-11T02:55:00Z)
Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation [19.139752434303688]
Managing the emotional aspect remains a challenge in automatic music generation. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework.
arXiv Detail & Related papers (2024-07-30T16:29:28Z)
Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss [80.79641247882012]
We focus on unsupervised feature learning for Multimodal Emotion Recognition (MER) We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature.
arXiv Detail & Related papers (2022-07-23T10:11:24Z)
Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation [34.726185927120355]
We employ music signals as a supervisory modality to EEG, aiming to project their semantic correspondence onto a common representation space. We utilize a bi-modal framework by combining an LSTM-based attention model to process EEG and a pre-trained model for music tagging, along with a reverse domain discriminator to align the distributions of the two modalities. The resulting framework can be utilized for emotion recognition both directly, by performing supervised predictions from either modality, and indirectly, by providing relevant music samples to EEG input queries.
arXiv Detail & Related papers (2022-02-20T07:32:12Z)
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition. Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction. Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)
SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images. To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features. We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z)
Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction. To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network. Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z)
Recognizing Emotions evoked by Movies using Multitask Learning [3.4290619267487488]
Methods for recognizing evoked emotions are usually trained on human annotated data. We propose two deep learning architectures: a Single-Task (ST) architecture and a Multi-Task (MT) architecture. Our results show that the MT approach can more accurately model each viewer and the aggregated annotation when compared to methods that are directly trained on the aggregated annotations.
arXiv Detail & Related papers (2021-07-30T10:21:40Z)
Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content. Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.