Related papers: MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

URL: http://arxiv.org/abs/2312.10307v2
Date: Tue, 2 Jan 2024 02:36:22 GMT
Title: MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion
Authors: Shulei Ji and Xinyu Yang
Abstract summary: We present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space. Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music.
Score: 16.658813060879293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learning-based emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements.

Related papers

Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics. We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation [13.052429836407052]
EmoMV is an emotion-driven music-to-visual manipulation method. We evaluate EmoMV using a multi-scale framework that includes image quality metrics, aesthetic assessments, and EEG measurements. Our results demonstrate that EmoMV effectively translates music's emotional content into visually compelling images.
arXiv Detail & Related papers (2025-01-03T08:41:53Z)
Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance [11.89773040110695]
We conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG) We employ diverse audio encoders alongside Frechet Audio Distance (FAD), a reference-free evaluation metric.
arXiv Detail & Related papers (2024-09-23T20:59:15Z)
Emotion-Driven Melody Harmonization via Melodic Variation and Functional Representation [16.790582113573453]
Emotion-driven melody aims to generate diverse harmonies for a single melody to convey desired emotions. Previous research found it hard to alter the perceived emotional valence of lead sheets only by harmonizing the same melody with different chords. In this paper, we propose a novel functional representation for symbolic music.
arXiv Detail & Related papers (2024-07-29T17:05:12Z)
Emotion Manipulation Through Music -- A Deep Learning Interactive Visual Approach [0.0]
We introduce a novel way to manipulate the emotional content of a song using AI tools. Our goal is to achieve the desired emotion while leaving the original melody as intact as possible. This research may contribute to on-demand custom music generation, the automated remixing of existing work, and music playlists tuned for emotional progression.
arXiv Detail & Related papers (2024-06-12T20:12:29Z)
Are Words Enough? On the semantic conditioning of affective music generation [1.534667887016089]
This scoping review aims to analyze and discuss the possibilities of music generation conditioned by emotions. In detail, we review two main paradigms adopted in automatic music generation: rules-based and machine-learning models. We conclude that overcoming the limitation and ambiguity of language to express emotions through music has the potential to impact the creative industries.
arXiv Detail & Related papers (2023-11-07T00:19:09Z)
REMAST: Real-time Emotion-based Music Arrangement with Soft Transition [29.34094293561448]
Music as an emotional intervention medium has important applications in scenarios such as music therapy, games, and movies. We propose REMAST to achieve emotion real-time fit and smooth transition simultaneously. According to the evaluation results, REMAST surpasses the state-of-the-art methods in objective and subjective metrics.
arXiv Detail & Related papers (2023-05-14T00:09:48Z)
Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z)
Contrastive Learning with Positive-Negative Frame Mask for Music Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z)
A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition [76.65908232134203]
Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks.
arXiv Detail & Related papers (2022-01-15T07:45:10Z)
Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z)
A Circular-Structured Representation for Visual Emotion Distribution Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z)
Musical Prosody-Driven Emotion Classification: Interpreting Vocalists Portrayal of Emotions Through Machine Learning [0.0]
The role of musical prosody remains under-explored despite several studies demonstrating a strong connection between prosody and emotion. In this study, we restrict the input of traditional machine learning algorithms to the features of musical prosody. We utilize a methodology for individual data collection from vocalists, and personal ground truth labeling by the artist themselves.
arXiv Detail & Related papers (2021-06-04T15:40:19Z)
Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space [80.49156615923106]
Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space.
arXiv Detail & Related papers (2020-08-22T20:12:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.