MPJudge: Towards Perceptual Assessment of Music-Induced Paintings
- URL: http://arxiv.org/abs/2511.07137v1
- Date: Mon, 10 Nov 2025 14:18:27 GMT
- Title: MPJudge: Towards Perceptual Assessment of Music-Induced Paintings
- Authors: Shiqi Jiang, Tianyi Liang, Changbo Wang, Chenhui Li,
- Abstract summary: Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music.<n>We propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art.<n>We present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism.
- Score: 25.063505095572093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.
Related papers
- A Study on the Data Distribution Gap in Music Emotion Recognition [7.281487567929003]
Music Emotion Recognition (MER) is a task deeply connected to human perception.<n>Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres.<n>We address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations.
arXiv Detail & Related papers (2025-10-06T10:57:05Z) - Emergence of Painting Ability via Recognition-Driven Evolution [49.666177849272856]
We present a model with a stroke branch and a palette branch that together simulate human-like painting.<n>We quantify the efficiency of visual communication by measuring the recognition accuracy achieved with machine vision.<n> Experimental results show that our model achieves superior performance in high-level recognition tasks.
arXiv Detail & Related papers (2025-01-09T04:37:31Z) - Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Motif-Centric Representation Learning for Symbolic Music [5.781931021964343]
We learn the implicit relationship between motifs and their variations via representation learning.
A regularization-based method, VICReg, is adopted for pretraining, while contrastive learning is used for fine-tuning.
We visualize the acquired motif representations, offering an intuitive comprehension of the overall structure of a music piece.
arXiv Detail & Related papers (2023-09-19T13:09:03Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - Tracing Back Music Emotion Predictions to Sound Sources and Intuitive
Perceptual Qualities [6.832341432995627]
Music emotion recognition is an important task in MIR (Music Information Retrieval) research.
One important step towards better models would be to understand what a model is actually learning from the data.
We show how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction.
arXiv Detail & Related papers (2021-06-14T22:49:19Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.