Tr\"aumerAI: Dreaming Music with StyleGAN
- URL: http://arxiv.org/abs/2102.04680v1
- Date: Tue, 9 Feb 2021 07:04:22 GMT
- Title: Tr\"aumerAI: Dreaming Music with StyleGAN
- Authors: Dasaem Jeong and Seungheon Doh and Taegyun Kwon
- Abstract summary: We propose a neural music visualizer directly mapping deep music embeddings to style embeddings of StyleGAN.
An annotator listened to 100 music clips of 10 seconds long and selected an image that suits the music among the StyleGAN-generated examples.
The generated examples show that the mapping between audio and video makes a certain level of intra-segment similarity and inter-segment dissimilarity.
- Score: 2.578242050187029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this paper to generate a visually appealing video that responds
to music with a neural network so that each frame of the video reflects the
musical characteristics of the corresponding audio clip. To achieve the goal,
we propose a neural music visualizer directly mapping deep music embeddings to
style embeddings of StyleGAN, named Tr\"aumerAI, which consists of a music
auto-tagging model using short-chunk CNN and StyleGAN2 pre-trained on WikiArt
dataset. Rather than establishing an objective metric between musical and
visual semantics, we manually labeled the pairs in a subjective manner. An
annotator listened to 100 music clips of 10 seconds long and selected an image
that suits the music among the 200 StyleGAN-generated examples. Based on the
collected data, we trained a simple transfer function that converts an audio
embedding to a style embedding. The generated examples show that the mapping
between audio and video makes a certain level of intra-segment similarity and
inter-segment dissimilarity.
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs.
We develop a generative video-music Transformer with a novel semantic video-music alignment scheme.
New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs.
VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types.
It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z) - MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention.
Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs.
Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.