AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent
Videos with Deep Learning
- URL: http://arxiv.org/abs/2002.10981v1
- Date: Fri, 21 Feb 2020 09:08:28 GMT
- Title: AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent
Videos with Deep Learning
- Authors: Sanchita Ghose, John J. Prevost
- Abstract summary: We present AutoFoley, a fully-automated deep learning tool that can be used to synthesize a representative audio track for videos.
AutoFoley can be used in the applications where there is either no corresponding audio file associated with the video or in cases where there is a need to identify critical scenarios.
Our experiments show that the synthesized sounds are realistically portrayed with accurate temporal synchronization of the associated visual inputs.
- Score: 5.33024001730262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In movie productions, the Foley Artist is responsible for creating an overlay
soundtrack that helps the movie come alive for the audience. This requires the
artist to first identify the sounds that will enhance the experience for the
listener thereby reinforcing the Directors's intention for a given scene. In
this paper, we present AutoFoley, a fully-automated deep learning tool that can
be used to synthesize a representative audio track for videos. AutoFoley can be
used in the applications where there is either no corresponding audio file
associated with the video or in cases where there is a need to identify
critical scenarios and provide a synthesized, reinforced soundtrack. An
important performance criterion of the synthesized soundtrack is to be
time-synchronized with the input video, which provides for a realistic and
believable portrayal of the synthesized sound. Unlike existing sound prediction
and generation architectures, our algorithm is capable of precise recognition
of actions as well as inter-frame relations in fast moving video clips by
incorporating an interpolation technique and Temporal Relationship Networks
(TRN). We employ a robust multi-scale Recurrent Neural Network (RNN) associated
with a Convolutional Neural Network (CNN) for a better understanding of the
intricate input-to-output associations over time. To evaluate AutoFoley, we
create and introduce a large scale audio-video dataset containing a variety of
sounds frequently used as Foley effects in movies. Our experiments show that
the synthesized sounds are realistically portrayed with accurate temporal
synchronization of the associated visual inputs. Human qualitative testing of
AutoFoley show over 73% of the test subjects considered the generated
soundtrack as original, which is a noteworthy improvement in cross-modal
research in sound synthesis.
Related papers
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - An Initial Exploration: Learning to Generate Realistic Audio for Silent
Video [0.0]
We develop a framework that observes video in it's natural sequence and generates realistic audio to accompany it.
Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs.
We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively.
arXiv Detail & Related papers (2023-08-23T20:08:56Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z) - Audeo: Audio Generation for a Silent Performance Video [17.705770346082023]
We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video.
Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events.
arXiv Detail & Related papers (2020-06-23T00:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.