SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis
- URL: http://arxiv.org/abs/2310.15247v1
- Date: Mon, 23 Oct 2023 18:01:36 GMT
- Title: SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis
- Authors: Marco Comunit\`a, Riccardo F. Gramaccioni, Emilian Postolache,
Emanuele Rodol\`a, Danilo Comminiello, Joshua D. Reiss
- Abstract summary: One of the most time-consuming steps when designing sound is synchronizing audio with video.
In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video.
We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
- Score: 9.118448725265669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound design involves creatively selecting, recording, and editing sound
effects for various media like cinema, video games, and virtual/augmented
reality. One of the most time-consuming steps when designing sound is
synchronizing audio with video. In some cases, environmental recordings from
video shoots are available, which can aid in the process. However, in video
games and animations, no reference audio exists, requiring manual annotation of
event timings from the video. We propose a system to extract repetitive actions
onsets from a video, which are then used - in conjunction with audio or textual
embeddings - to condition a diffusion model trained to generate a new
synchronized sound effects audio track. In this way, we leave complete creative
control to the sound designer while removing the burden of synchronization with
video. Furthermore, editing the onset track or changing the conditioning
embedding requires much less effort than editing the audio track itself,
simplifying the sonification process. We provide sound examples, source code,
and pretrained models to faciliate reproducibility
Related papers
- ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Conditional Generation of Audio from Video via Foley Analogies [19.681437827280757]
Sound effects that designers add to videos are designed to convey a particular artistic effect and may be quite different from a scene's true sound.
Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, we propose the problem of conditional Foley.
We show through human studies and automated evaluation metrics that our model successfully generates sound from video.
arXiv Detail & Related papers (2023-04-17T17:59:45Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - Soundify: Matching Sound Effects to Video [4.225919537333002]
This paper presents Soundify, a system that assists editors in matching sounds to video.
Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio.
arXiv Detail & Related papers (2021-12-17T19:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.