Related papers: Conditional Generation of Audio from Video via Foley Analogies

Conditional Generation of Audio from Video via Foley Analogies

URL: http://arxiv.org/abs/2304.08490v1
Date: Mon, 17 Apr 2023 17:59:45 GMT
Title: Conditional Generation of Audio from Video via Foley Analogies
Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell and Andrew Owens
Abstract summary: Sound effects that designers add to videos are designed to convey a particular artistic effect and may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, we propose the problem of conditional Foley. We show through human studies and automated evaluation metrics that our model successfully generates sound from video.
Score: 19.681437827280757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/

Related papers

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance [15.29891397291197]
We propose a step-by-step video-to-audio generation method that sequentially produces individual audio tracks.<n>Our approach mirrors traditional Foley, aiming to capture all sound events induced by a given video comprehensively.
arXiv Detail & Related papers (2025-06-26T04:20:08Z)
Seeing Voices: Generating A-Roll Video from Audio with Mirage [12.16029287095035]
Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation.<n>We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input.
arXiv Detail & Related papers (2025-06-09T22:56:02Z)
Video-Guided Foley Sound Generation with Multimodal Controls [30.515964061350395]
MultiFoley is a model designed for video-guided sound generation. It supports multimodal conditioning through text, audio, and video. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio.
arXiv Detail & Related papers (2024-11-26T18:59:58Z)
Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z)
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training. We propose a novel ambient-aware audio generation model, AV-LDM. Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z)
SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis [9.118448725265669]
One of the most time-consuming steps when designing sound is synchronizing audio with video. In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
arXiv Detail & Related papers (2023-10-23T18:01:36Z)
Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos. The sound should be both temporally and content-wise aligned with visual signals. Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.