Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
- URL: http://arxiv.org/abs/2509.22378v1
- Date: Fri, 26 Sep 2025 14:07:29 GMT
- Title: Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
- Authors: Zijian Zhao, Dian Jin, Zijing Zhou,
- Abstract summary: We propose the first Vision Language Model (VLM)-based Image-to-Music (I2M) framework that offers high interpretability and low computational cost.<n>Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language.<n>We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training.
- Score: 2.894273842598117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .
Related papers
- Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment [8.468469176803241]
We introduce ArtSound, a large-scale dataset of 105,884 artwork-music pairs enriched with dual-modality captions.<n>We propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation.<n>ArtToMus maps digitized artworks to music without image-to-text translation or language-based semantic supervision.
arXiv Detail & Related papers (2026-02-19T18:23:58Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [9.311353871322325]
Mozart's Touch is a framework capable of generating music aligned with cross-modal inputs such as images, videos, and text.
Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models.
arXiv Detail & Related papers (2024-05-05T03:15:52Z) - Towards Contrastive Learning in Music Video Domain [46.29203572184694]
We create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss.
For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song dataset.
Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks.
arXiv Detail & Related papers (2023-09-01T09:08:21Z) - Tackling Vision Language Tasks Through Learning Inner Monologues [10.795616787372625]
We propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems.
IMMO simulates inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves.
The results suggest IMMO can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models.
arXiv Detail & Related papers (2023-08-19T10:10:49Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.