Sound-Guided Semantic Video Generation
- URL: http://arxiv.org/abs/2204.09273v2
- Date: Thu, 21 Apr 2022 02:13:31 GMT
- Title: Sound-Guided Semantic Video Generation
- Authors: Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Jihyun Bae, Chanyoung Kim,
Won Jeong Ryoo, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim
- Abstract summary: We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
- Score: 15.225598817462478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent
space is useful for realistic video generation. However, the generated motion
in the video is usually not semantically meaningful due to the difficulty of
determining the direction and magnitude in the StyleGAN latent space. In this
paper, we propose a framework to generate realistic videos by leveraging
multimodal (sound-image-text) embedding space. As sound provides the temporal
contexts of the scene, our framework learns to generate a video that is
semantically consistent with sound. First, our sound inversion module maps the
audio directly into the StyleGAN latent space. We then incorporate the
CLIP-based multimodal embedding space to further provide the audio-visual
relationships. Finally, the proposed frame generator learns to find the
trajectory in the latent space which is coherent with the corresponding sound
and generates a video in a hierarchical manner. We provide the new
high-resolution landscape video dataset (audio-visual pair) for the
sound-guided video generation task. The experiments show that our model
outperforms the state-of-the-art methods in terms of video quality. We further
show several applications including image and video editing to verify the
effectiveness of our method.
Related papers
- TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation [4.019144083959918]
We present TANGO, a framework for generating co-speech body-gesture videos.
Given a few-minute, single-speaker reference video, TANGO produces high-fidelity videos with synchronized body gestures.
arXiv Detail & Related papers (2024-10-05T16:30:46Z) - Context-aware Talking Face Video Generation [30.49058027339904]
We consider a novel and practical case for talking face video generation.
We take facial landmarks as a control signal to bridge the driving audio, talking context and generated videos.
The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
arXiv Detail & Related papers (2024-02-28T06:25:50Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Audio-driven Neural Gesture Reenactment with Video Motion Graphs [30.449816206864632]
We present a method that reenacts a high-quality video with gestures matching a target speech audio.
The key idea of our method is to split and re-assemble clips from a reference video through a novel video motion graph encoding valid transitions between clips.
To seamlessly connect different clips in the reenactment, we propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips.
arXiv Detail & Related papers (2022-07-23T14:02:57Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - A Good Image Generator Is What You Need for High-Resolution Video
Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos.
We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.