Related papers: Generative Disco: Text-to-Video Generation for Music Visualization

Generative Disco: Text-to-Video Generation for Music Visualization

URL: http://arxiv.org/abs/2304.08551v2
Date: Thu, 28 Sep 2023 16:14:54 GMT
Title: Generative Disco: Text-to-Video Generation for Music Visualization
Authors: Vivian Liu, Tao Long, Nathan Raw, Lydia Chilton
Abstract summary: We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation. The system helps users visualize music in intervals by finding prompts to describe the images that intervals start and end on and interpolating between them to the beat of the music. We introduce design patterns for improving these generated videos: transitions, which express shifts in color, time, subject, or style, and holds, which help focus the video on subjects.
Score: 9.53563436241774
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation. The system helps users visualize music in intervals by finding prompts to describe the images that intervals start and end on and interpolating between them to the beat of the music. We introduce design patterns for improving these generated videos: transitions, which express shifts in color, time, subject, or style, and holds, which help focus the video on subjects. A study with professionals showed that transitions and holds were a highly expressive framework that enabled them to build coherent visual narratives. We conclude on the generalizability of these patterns and the potential of generated video for creative professionals.

Related papers

Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation. High-level semantics are conveyed through a cross-attention mechanism. Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z)
Mozualization: Crafting Music and Visual Representation with Multimodal AI [11.229032883997748]
Mozualization is a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs. Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music.
arXiv Detail & Related papers (2025-04-05T08:22:20Z)
Every Image Listens, Every Image Dances: Music-Driven Image Animation [8.085267959520843]
MuseDance is an end-to-end model that animates reference images using both music and text inputs. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences. We present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions.
arXiv Detail & Related papers (2025-01-30T23:38:51Z)
Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise [4.9485163144728235]
This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise. By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion model designed for music generation.
arXiv Detail & Related papers (2024-12-12T05:08:36Z)
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z)
Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts. Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset. Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations. We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Tr\"aumerAI: Dreaming Music with StyleGAN [2.578242050187029]
We propose a neural music visualizer directly mapping deep music embeddings to style embeddings of StyleGAN. An annotator listened to 100 music clips of 10 seconds long and selected an image that suits the music among the StyleGAN-generated examples. The generated examples show that the mapping between audio and video makes a certain level of intra-segment similarity and inter-segment dissimilarity.
arXiv Detail & Related papers (2021-02-09T07:04:22Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective. The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone. The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.