Collaborative Learning to Generate Audio-Video Jointly
- URL: http://arxiv.org/abs/2104.02656v1
- Date: Thu, 1 Apr 2021 01:00:51 GMT
- Title: Collaborative Learning to Generate Audio-Video Jointly
- Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P
Namboodiri, Preethi Jyothi
- Abstract summary: We propose a method to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities.
The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples.
- Score: 39.193054126350496
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: There have been a number of techniques that have demonstrated the generation
of multimedia data for one modality at a time using GANs, such as the ability
to generate images, videos, and audio. However, so far, the task of multi-modal
generation of data, specifically for audio and videos both, has not been
sufficiently well-explored. Towards this, we propose a method that demonstrates
that we are able to generate naturalistic samples of video and audio data by
the joint correlated generation of audio and video modalities. The proposed
method uses multiple discriminators to ensure that the audio, video, and the
joint output are also indistinguishable from real-world samples. We present a
dataset for this task and show that we are able to generate realistic samples.
This method is validated using various standard metrics such as Inception
Score, Frechet Inception Distance (FID) and through human evaluation.
Related papers
- Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment [18.08290178587821]
We propose a method for generating images of visual scenes from diverse in-the-wild sounds.
This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals.
arXiv Detail & Related papers (2024-12-09T05:04:50Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.