Emotion-Based End-to-End Matching Between Image and Music in
Valence-Arousal Space
- URL: http://arxiv.org/abs/2009.05103v1
- Date: Sat, 22 Aug 2020 20:12:23 GMT
- Title: Emotion-Based End-to-End Matching Between Image and Music in
Valence-Arousal Space
- Authors: Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng
Yang, Kurt Keutzer
- Abstract summary: Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger.
Existing emotion-based image and music matching methods either employ limited categorical emotion states or train the matching model using an impractical multi-stage pipeline.
In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space.
- Score: 80.49156615923106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both images and music can convey rich semantics and are widely used to induce
specific emotions. Matching images and music with similar emotions might help
to make emotion perceptions more vivid and stronger. Existing emotion-based
image and music matching methods either employ limited categorical emotion
states which cannot well reflect the complexity and subtlety of emotions, or
train the matching model using an impractical multi-stage pipeline. In this
paper, we study end-to-end matching between image and music based on emotions
in the continuous valence-arousal (VA) space. First, we construct a large-scale
dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K
image-music pairs. Second, we propose cross-modal deep continuous metric
learning (CDCML) to learn a shared latent embedding space which preserves the
cross-modal similarity relationship in the continuous matching space. Finally,
we refine the embedding space by further preserving the single-modal emotion
relationship in the VA spaces of both images and music. The metric learning in
the embedding space and task regression in the label space are jointly
optimized for both cross-modal matching and single-modal VA prediction. The
extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML
for emotion-based image and music matching as compared to the state-of-the-art
approaches.
Related papers
- Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - Joint Learning of Emotions in Music and Generalized Sounds [6.854732863866882]
We propose the use of multiple datasets as a multi-domain learning technique.
Our approach involves creating a common space encompassing features that characterize both generalized sounds and music.
We performed joint learning on the common feature space, leveraging heterogeneous model architectures.
arXiv Detail & Related papers (2024-08-04T12:19:03Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - MusER: Musical Element-Based Regularization for Generating Symbolic
Music with Emotion [16.658813060879293]
We present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements.
By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space.
Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music.
arXiv Detail & Related papers (2023-12-16T03:50:13Z) - Multi-Branch Network for Imagery Emotion Prediction [4.618814297494939]
We present a novel Multi-Branch Network (MBN) to predict both discrete and continuous emotions in an image.
Our proposed method significantly outperforms state-of-the-art methods with 28.4% in mAP and 0.93 in MAE.
arXiv Detail & Related papers (2023-12-12T18:34:56Z) - StyleEDL: Style-Guided High-order Attention Network for Image Emotion
Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL.
StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents.
In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Comparison and Analysis of Deep Audio Embeddings for Music Emotion
Recognition [1.6143012623830792]
We use state-of-the-art pre-trained deep audio embedding methods to be used in the Music Emotion Recognition task.
Deep audio embeddings represent musical emotion semantics for the MER task without expert human engineering.
arXiv Detail & Related papers (2021-04-13T21:09:54Z) - Modality-Transferable Emotion Embeddings for Low-Resource Multimodal
Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues.
Our model achieves state-of-the-art performance on most of the emotion categories.
Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.