KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
- URL: http://arxiv.org/abs/2504.09656v1
- Date: Sun, 13 Apr 2025 17:06:03 GMT
- Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
- Authors: Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum,
- Abstract summary: KeyVID is an a-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals.<n>We demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets.
- Score: 28.859027881497376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.
Related papers
- Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation [62.218932509432314]
Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames.<n>We learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation.
arXiv Detail & Related papers (2025-04-08T07:23:28Z) - Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos [6.062921267681344]
Deep facial animation techniques efficiently compress talking head videos by applying deep generative models.<n>In this paper, we propose a novel learned animation that generates natural facial videos using past and future frames.
arXiv Detail & Related papers (2025-03-12T19:39:09Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - Generative Inbetweening through Frame-wise Conditions-Driven Video Generation [63.43583844248389]
generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input.<n>We propose a Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames.<n>Our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear curves.
arXiv Detail & Related papers (2024-12-16T13:19:41Z) - Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior [13.595032265551184]
Video-to-video synthesis poses significant challenges in maintaining character consistency, smooth temporal transitions, and preserving visual quality during fast motion.
We propose an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations.
This enables a greater number of cross-frame attentions over more frames within the same computational budget.
arXiv Detail & Related papers (2024-06-07T12:12:25Z) - Predictive Coding For Animation-Based Video Compression [13.161311799049978]
We propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame.
Our experiments indicate a significant gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC.
arXiv Detail & Related papers (2023-07-09T14:40:54Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - E-VFIA : Event-Based Video Frame Interpolation with Attention [8.93294761619288]
We propose an event-based video frame with attention (E-VFIA) as a lightweight kernel-based method.
E-VFIA fuses event information with standard video frames by deformable convolutions to generate high quality interpolated frames.
The proposed method represents events with high temporal resolution and uses a multi-head self-attention mechanism to better encode event-based information.
arXiv Detail & Related papers (2022-09-19T21:40:32Z) - Video Frame Interpolation without Temporal Priors [91.04877640089053]
Video frame aims to synthesize non-exist intermediate frames in a video sequence.
The temporal priors of videos, i.e. frames per second (FPS) and frame exposure time, may vary from different camera sensors.
We devise a novel optical flow refinement strategy for better synthesizing results.
arXiv Detail & Related papers (2021-12-02T12:13:56Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.