Related papers: Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition

Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition

URL: http://arxiv.org/abs/2512.17946v1
Date: Mon, 15 Dec 2025 03:27:35 GMT
Title: Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition
Authors: Haiying Xia, Zhongyi Huang, Yumei Tan, Shuxiang Song,
Abstract summary: Music emotion recognition is a key task in symbolic music understanding.<n>Recent approaches have shown promising results by fine-tuning models to map musical semantics to emotional labels.<n>We propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model.
Score: 11.051812953517521
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Music emotion recognition is a key task in symbolic music understanding (SMER). Recent approaches have shown promising results by fine-tuning large-scale pre-trained models (e.g., MIDIBERT, a benchmark in symbolic music understanding) to map musical semantics to emotional labels. While these models effectively capture distributional musical semantics, they often overlook tonal structures, particularly musical modes, which play a critical role in emotional perception according to music psychology. In this paper, we investigate the representational capacity of MIDIBERT and identify its limitations in capturing mode-emotion associations. To address this issue, we propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model. Specifically, we first conduct a mode augmentation analysis, which reveals that MIDIBERT fails to effectively encode emotion-mode correlations. We then identify the least emotion-relevant layer within MIDIBERT and introduce a Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features, thereby enhancing the model's capability in emotional representation and inference. Extensive experiments on the EMOPIA and VGMIDI datasets demonstrate that our mode injection strategy significantly improves SMER performance, achieving accuracies of 75.2% and 59.1%, respectively. These results validate the effectiveness of mode-guided modeling in symbolic music emotion recognition.

Related papers

Music Flamingo: Scaling Music Understanding in Audio Language Models [98.94537017112704]
Music Flamingo is a novel large audio-language model designed to advance music understanding in foundational audio models.<n> MF-Skills is a dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context.<n>We introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards.
arXiv Detail & Related papers (2025-11-13T13:21:09Z)
Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation [26.273309051211204]
Video-to-music (V2M) generation aims to create music that aligns with visual content.<n>We propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model.<n>For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF)
arXiv Detail & Related papers (2025-11-12T08:02:06Z)
SyMuPe: Affective and Controllable Symbolic Music Performance [0.00746020873338928]
We present SyMuPe, a novel framework for developing and training affective and controllable piano performance models.<n>Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks.<n>For emotion control, we present and analyze samples generated under different text conditioning scenarios.
arXiv Detail & Related papers (2025-11-05T12:42:08Z)
Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music [47.95375326361059]
We introduce Amadeus, a novel symbolic music generation framework.<n>Amadeus adopts an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes.<n>We conduct extensive experiments on unconditional and text-conditioned generation tasks.
arXiv Detail & Related papers (2025-08-28T11:15:44Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries [1.1743167854433303]
EMSYNC is a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries.<n>We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts.<n>In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.
arXiv Detail & Related papers (2025-02-14T13:32:59Z)
Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation [19.139752434303688]
Managing the emotional aspect remains a challenge in automatic music generation. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework.
arXiv Detail & Related papers (2024-07-30T16:29:28Z)
A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition [76.65908232134203]
Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks.
arXiv Detail & Related papers (2022-01-15T07:45:10Z)
Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities [6.832341432995627]
Music emotion recognition is an important task in MIR (Music Information Retrieval) research. One important step towards better models would be to understand what a model is actually learning from the data. We show how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction.
arXiv Detail & Related papers (2021-06-14T22:49:19Z)
Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions. Our framework integrates a contextualized embedding encoder with a multi-head probing model. Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z)
Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. Our model achieves state-of-the-art performance on most of the emotion categories. Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.