Sound2Sight: Generating Visual Dynamics from Sound and Context
- URL: http://arxiv.org/abs/2007.12130v1
- Date: Thu, 23 Jul 2020 16:57:44 GMT
- Title: Sound2Sight: Generating Visual Dynamics from Sound and Context
- Authors: Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja
- Abstract summary: We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
- Score: 36.38300120482868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning associations across modalities is critical for robust multimodal
reasoning, especially when a modality may be missing during inference. In this
paper, we study this problem in the context of audio-conditioned visual
synthesis -- a task that is important, for example, in occlusion reasoning.
Specifically, our goal is to generate future video frames and their motion
dynamics conditioned on audio and a few past frames. To tackle this problem, we
present Sound2Sight, a deep variational framework, that is trained to learn a
per frame stochastic prior conditioned on a joint embedding of audio and past
frames. This embedding is learned via a multi-head attention-based audio-visual
transformer encoder. The learned prior is then sampled to further condition a
video forecasting module to generate future frames. The stochastic prior allows
the model to sample multiple plausible futures that are consistent with the
provided audio and the past context. Moreover, to improve the quality and
coherence of the generated frames, we propose a multimodal discriminator that
differentiates between a synthesized and a real audio-visual clip. We
empirically evaluate our approach, vis-\'a-vis closely-related prior methods,
on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise
Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums
dataset. Our extensive experiments demonstrate that Sound2Sight significantly
outperforms the state of the art in the generated video quality, while also
producing diverse video content.
Related papers
- Video-to-Audio Generation with Hidden Alignment [28.284162057261565]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.
We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types.
It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.