Conditional Generation of Audio from Video via Foley Analogies
- URL: http://arxiv.org/abs/2304.08490v1
- Date: Mon, 17 Apr 2023 17:59:45 GMT
- Title: Conditional Generation of Audio from Video via Foley Analogies
- Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell and Andrew Owens
- Abstract summary: Sound effects that designers add to videos are designed to convey a particular artistic effect and may be quite different from a scene's true sound.
Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, we propose the problem of conditional Foley.
We show through human studies and automated evaluation metrics that our model successfully generates sound from video.
- Score: 19.681437827280757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sound effects that designers add to videos are designed to convey a
particular artistic effect and, thus, may be quite different from a scene's
true sound. Inspired by the challenges of creating a soundtrack for a video
that differs from its true sound, but that nonetheless matches the actions
occurring on screen, we propose the problem of conditional Foley. We present
the following contributions to address this problem. First, we propose a
pretext task for training our model to predict sound for an input video clip
using a conditional audio-visual clip sampled from another time within the same
source video. Second, we propose a model for generating a soundtrack for a
silent input video, given a user-supplied example that specifies what the video
should "sound like". We show through human studies and automated evaluation
metrics that our model successfully generates sound from video, while varying
its output according to the content of a supplied example. Project site:
https://xypb.github.io/CondFoleyGen/
Related papers
- Video-Guided Foley Sound Generation with Multimodal Controls [30.515964061350395]
MultiFoley is a model designed for video-guided sound generation.
It supports multimodal conditioning through text, audio, and video.
A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio.
arXiv Detail & Related papers (2024-11-26T18:59:58Z) - Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene.
Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures.
We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis [9.118448725265669]
One of the most time-consuming steps when designing sound is synchronizing audio with video.
In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video.
We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
arXiv Detail & Related papers (2023-10-23T18:01:36Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.