Vis2Mus: Exploring Multimodal Representation Mapping for Controllable
Music Generation
- URL: http://arxiv.org/abs/2211.05543v1
- Date: Thu, 10 Nov 2022 13:01:26 GMT
- Title: Vis2Mus: Exploring Multimodal Representation Mapping for Controllable
Music Generation
- Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia
- Abstract summary: We explore the representation mapping from the domain of visual arts to the domain of music.
We adopt an analysis-by-interpret approach that combines deep music representation learning with user studies.
We release the Vis2Mus system as a controllable interface for symbolic music generation.
- Score: 11.140337453072311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we explore the representation mapping from the domain of
visual arts to the domain of music, with which we can use visual arts as an
effective handle to control music generation. Unlike most studies in multimodal
representation learning that are purely data-driven, we adopt an
analysis-by-synthesis approach that combines deep music representation learning
with user studies. Such an approach enables us to discover
\textit{interpretable} representation mapping without a huge amount of paired
data. In particular, we discover that visual-to-music mapping has a nice
property similar to equivariant. In other words, we can use various image
transformations, say, changing brightness, changing contrast, style transfer,
to control the corresponding transformations in the music domain. In addition,
we released the Vis2Mus system as a controllable interface for symbolic music
generation.
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Towards Contrastive Learning in Music Video Domain [46.29203572184694]
We create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss.
For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song dataset.
Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks.
arXiv Detail & Related papers (2023-09-01T09:08:21Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - Crossing You in Style: Cross-modal Style Transfer from Music to Visual
Arts [11.96629917390208]
Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity.
We solve the music-to-visual style transfer problem in two steps: music visualization and style transfer.
Experiments are conducted on WikiArt-IMSLP, a dataset including Western music recordings and paintings listed by decades.
arXiv Detail & Related papers (2020-09-17T05:58:13Z) - Embeddings as representation for symbolic music [0.0]
A representation technique that allows encoding music in a way that contains musical meaning would improve the results of any model trained for computer music tasks.
In this paper, we experiment with embeddings to represent musical notes from 3 different variations of a dataset and analyze if the model can capture useful musical patterns.
arXiv Detail & Related papers (2020-05-19T13:04:02Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.