FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos
- URL: http://arxiv.org/abs/2107.09262v1
- Date: Tue, 20 Jul 2021 04:59:26 GMT
- Title: FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos
- Authors: Sanchita Ghose and John J. Prevost
- Abstract summary: We introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task.
Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning based visual to sound generation systems essentially need to be
developed particularly considering the synchronicity aspects of visual and
audio features with time. In this research we introduce a novel task of guiding
a class conditioned generative adversarial network with the temporal visual
information of a video input for visual to sound generation task adapting the
synchronicity traits between audio-visual modalities. Our proposed FoleyGAN
model is capable of conditioning action sequences of visual events leading
towards generating visually aligned realistic sound tracks. We expand our
previously proposed Automatic Foley dataset to train with FoleyGAN and evaluate
our synthesized sound through human survey that shows noteworthy (on average
81\%) audio-visual synchronicity performance. Our approach also outperforms in
statistical experiments compared with other baseline models and audio-visual
datasets.
Related papers
- Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.
We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention
Estimation for Non-Profilic Faces [28.245662058349854]
In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces.
We use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision.
Our model can utilize any of the available modalities for task-specific inference.
arXiv Detail & Related papers (2022-07-07T02:23:02Z) - Self-Supervised Audio-Visual Representation Learning with Relaxed
Cross-Modal Temporal Synchronicity [12.995632804090198]
CrissCross is a self-supervised framework for learning audio-visual representations.
We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations.
arXiv Detail & Related papers (2021-11-09T20:24:19Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.