The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
Diffusion
- URL: http://arxiv.org/abs/2309.04509v1
- Date: Fri, 8 Sep 2023 12:21:01 GMT
- Title: The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
Diffusion
- Authors: Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon,
Sangpil Kim and Jinkyu Kim
- Abstract summary: We propose The Power of Sound model to incorporate audio input that includes both changeable temporal semantics and magnitude.
To generate video frames, TPoS utilizes a latent stable diffusion model with semantic information, which is then guided by the sequential audio embedding.
We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation.
- Score: 23.398304611826642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, video generation has become a prominent generative tool and
has drawn significant attention. However, there is little consideration in
audio-to-video generation, though audio contains unique qualities like temporal
semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to
incorporate audio input that includes both changeable temporal semantics and
magnitude. To generate video frames, TPoS utilizes a latent stable diffusion
model with textual semantic information, which is then guided by the sequential
audio embedding from our pretrained Audio Encoder. As a result, this method
produces audio reactive video contents. We demonstrate the effectiveness of
TPoS across various tasks and compare its results with current state-of-the-art
techniques in the field of audio-to-video generation. More examples are
available at https://ku-vai.github.io/TPoS/
Related papers
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.