SoundBrush: Sound as a Brush for Visual Scene Editing
- URL: http://arxiv.org/abs/2501.00645v1
- Date: Tue, 31 Dec 2024 20:53:45 GMT
- Title: SoundBrush: Sound as a Brush for Visual Scene Editing
- Authors: Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh,
- Abstract summary: SoundBrush is a model that uses sound as a brush to edit and manipulate visual scenes.
Our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation.
- Score: 18.263162622783607
- License:
- Abstract: We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.
Related papers
- Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene.
Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures.
We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - Soundini: Sound-Guided Diffusion for Natural Video Editing [29.231939578629785]
We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting.
Our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties.
arXiv Detail & Related papers (2023-04-13T20:56:53Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Learning Visual Styles from Audio-Visual Associations [21.022027778790978]
We present a method for learning visual styles from unlabeled audio-visual data.
Our model learns to manipulate the texture of a scene to match a sound.
We show that audio can be an intuitive representation for manipulating images.
arXiv Detail & Related papers (2022-05-10T17:57:07Z) - Control-NeRF: Editable Feature Volumes for Scene Rendering and
Manipulation [58.16911861917018]
We present a novel method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis.
Our model couples learnt scene-specific feature volumes with a scene agnostic neural rendering network.
We demonstrate various scene manipulations, including mixing scenes, deforming objects and inserting objects into scenes, while still producing photo-realistic results.
arXiv Detail & Related papers (2022-04-22T17:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.