Related papers: Anchored Diffusion for Video Face Reenactment

Anchored Diffusion for Video Face Reenactment

URL: http://arxiv.org/abs/2407.15153v1
Date: Sun, 21 Jul 2024 13:14:17 GMT
Title: Anchored Diffusion for Video Face Reenactment
Authors: Idan Kligvasser, Regev Cohen, George Leifman, Ehud Rivlin, Michael Elad,
Abstract summary: We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
Score: 17.343307538702238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.

Related papers

Inference-based GAN Video Generation [47.53991869205973]
We propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder.<n>Existing models struggle with the temporal scaling of the generated videos.<n>We employ a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames.
arXiv Detail & Related papers (2025-12-25T20:14:38Z)
VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z)
HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation [39.69554411714128]
We propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a dataset containing 14,000 hours of high-quality video. HumanDiT supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation. Experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
arXiv Detail & Related papers (2025-02-07T11:36:36Z)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z)
Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once. This is in contrast to existing video models which synthesize distants followed by temporal super-resolution. By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.