LatentMan: Generating Consistent Animated Characters using Image Diffusion Models
- URL: http://arxiv.org/abs/2312.07133v2
- Date: Sun, 2 Jun 2024 10:50:54 GMT
- Title: LatentMan: Generating Consistent Animated Characters using Image Diffusion Models
- Authors: Abdelrahman Eldesokey, Peter Wonka,
- Abstract summary: We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models.
Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.
- Score: 44.18315132571804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.
Related papers
- StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention.
To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module.
By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z) - LoopAnimate: Loopable Salient Object Animation [19.761865029125524]
LoopAnimate is a novel method for generating videos with consistent start and end frames.
It achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
arXiv Detail & Related papers (2024-04-14T07:36:18Z) - Pix2Gif: Motion-Guided Diffusion for GIF Generation [70.64240654310754]
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation.
We propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts.
In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset.
arXiv Detail & Related papers (2024-03-07T16:18:28Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow.
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.