The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
  Diffusion
        - URL: http://arxiv.org/abs/2309.04509v1
- Date: Fri, 8 Sep 2023 12:21:01 GMT
- Title: The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
  Diffusion
- Authors: Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon,
  Sangpil Kim and Jinkyu Kim
- Abstract summary: We propose The Power of Sound model to incorporate audio input that includes both changeable temporal semantics and magnitude.
To generate video frames, TPoS utilizes a latent stable diffusion model with semantic information, which is then guided by the sequential audio embedding.
We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation.
- Score: 23.398304611826642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In recent years, video generation has become a prominent generative tool and
has drawn significant attention. However, there is little consideration in
audio-to-video generation, though audio contains unique qualities like temporal
semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to
incorporate audio input that includes both changeable temporal semantics and
magnitude. To generate video frames, TPoS utilizes a latent stable diffusion
model with textual semantic information, which is then guided by the sequential
audio embedding from our pretrained Audio Encoder. As a result, this method
produces audio reactive video contents. We demonstrate the effectiveness of
TPoS across various tasks and compare its results with current state-of-the-art
techniques in the field of audio-to-video generation. More examples are
available at https://ku-vai.github.io/TPoS/
 
      
        Related papers
        - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language   Models for Audio Generation and Editing [52.33281620699459]
 ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
 arXiv  Detail & Related papers  (2025-06-26T16:32:06Z)
- Seeing Voices: Generating A-Roll Video from Audio with Mirage [12.16029287095035]
 Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation.<n>We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input.
 arXiv  Detail & Related papers  (2025-06-09T22:56:02Z)
- Audio-Sync Video Generation with Multi-Stream Temporal Control [64.00019697525322]
 We introduce MTV, a versatile framework for video generation with precise audio-visual synchronization.<n>MTV separates audios into speech, effects, and tracks, enabling control over lip motion, event timing, and visual mood.<n>To support the framework, we additionally present DEmix, a dataset of high-quality cinematic videos and demixed audio tracks.
 arXiv  Detail & Related papers  (2025-06-09T17:59:42Z)
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event   Condition For Foley Sound [6.638504164134713]
 Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
 arXiv  Detail & Related papers  (2024-08-21T18:06:15Z)
- EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
 We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
 generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
 arXiv  Detail & Related papers  (2024-07-30T06:57:00Z)
- Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
 We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
 arXiv  Detail & Related papers  (2024-07-08T01:59:17Z)
- FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized   Sounds [14.636030346325578]
 We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
 arXiv  Detail & Related papers  (2024-07-01T17:35:56Z)
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through   Direct Preference Optimization [70.13218512896032]
 Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
 arXiv  Detail & Related papers  (2024-04-15T17:31:22Z)
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
  Latent Aligners [69.70590867769408]
 Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
 arXiv  Detail & Related papers  (2024-02-27T17:57:04Z)
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
  Adaptation [89.96013329530484]
 We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
 arXiv  Detail & Related papers  (2023-09-28T13:26:26Z)
- Exploring the Role of Audio in Video Captioning [59.679122191706426]
 We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
 arXiv  Detail & Related papers  (2023-06-21T20:54:52Z)
- VarietySound: Timbre-Controllable Video to Sound Generation via
  Unsupervised Information Disentanglement [68.42632589736881]
 We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
 arXiv  Detail & Related papers  (2022-11-19T11:12:01Z)
- AudioGen: Textually Guided Audio Generation [116.57006301417306]
 We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
 arXiv  Detail & Related papers  (2022-09-30T10:17:05Z)
- Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
 We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
 arXiv  Detail & Related papers  (2020-07-23T16:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.