Soundini: Sound-Guided Diffusion for Natural Video Editing
- URL: http://arxiv.org/abs/2304.06818v1
- Date: Thu, 13 Apr 2023 20:56:53 GMT
- Title: Soundini: Sound-Guided Diffusion for Natural Video Editing
- Authors: Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho,
Youngseo Kim, Huiwen Chang, Jinkyu Kim, Sangpil Kim
- Abstract summary: We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting.
Our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties.
- Score: 29.231939578629785
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a method for adding sound-guided visual effects to specific
regions of videos with a zero-shot setting. Animating the appearance of the
visual effect is challenging because each frame of the edited video should have
visual changes while maintaining temporal consistency. Moreover, existing video
editing solutions focus on temporal consistency across frames, ignoring the
visual style variations over time, e.g., thunderstorm, wave, fire crackling. To
overcome this limitation, we utilize temporal sound features for the dynamic
style. Specifically, we guide denoising diffusion probabilistic models with an
audio latent representation in the audio-visual latent space. To the best of
our knowledge, our work is the first to explore sound-guided natural video
editing from various sound sources with sound-specialized properties, such as
intensity, timbre, and volume. Additionally, we design optical flow-based
guidance to generate temporally consistent video frames, capturing the
pixel-wise relationship between adjacent frames. Experimental results show that
our method outperforms existing video editing techniques, producing more
realistic visual effects that reflect the properties of sound. Please visit our
page: https://kuai-lab.github.io/soundini-gallery/.
Related papers
- Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene.
Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures.
We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - An Initial Exploration: Learning to Generate Realistic Audio for Silent
Video [0.0]
We develop a framework that observes video in it's natural sequence and generates realistic audio to accompany it.
Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs.
We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively.
arXiv Detail & Related papers (2023-08-23T20:08:56Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Non-Rigid Neural Radiance Fields: Reconstruction and Novel View
Synthesis of a Dynamic Scene From Monocular Video [76.19076002661157]
Non-Rigid Neural Radiance Fields (NR-NeRF) is a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes.
We show that even a single consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views.
arXiv Detail & Related papers (2020-12-22T18:46:12Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.