ID-Animator: Zero-Shot Identity-Preserving Human Video Generation
- URL: http://arxiv.org/abs/2404.15275v3
- Date: Tue, 25 Jun 2024 16:57:27 GMT
- Title: ID-Animator: Zero-Shot Identity-Preserving Human Video Generation
- Authors: Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang,
- Abstract summary: ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training.
Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
- Score: 16.438935466843304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at https://github.com/ID-Animator/ID-Animator.
Related papers
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - Magic-Me: Identity-Specific Video Customized Diffusion [72.05925155000165]
We propose a controllable subject identity controllable video generation framework, termed Video Custom Diffusion (VCD)
With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation for stable video outputs.
We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines.
arXiv Detail & Related papers (2024-02-14T18:13:51Z) - StableIdentity: Inserting Anybody into Anywhere at First Sight [57.99693188913382]
We propose StableIdentity, which allows identity-consistent recontextualization with just one face image.
We are the first to directly inject the identity learned from a single image into video/3D generation without finetuning.
arXiv Detail & Related papers (2024-01-29T09:06:15Z) - Facial Expression Video Generation Based-On Spatio-temporal
Convolutional GAN: FEV-GAN [1.279257604152629]
We present a novel approach for generating videos of the six basic facial expressions.
Our approach is based on Spatio-temporal Conal GANs, that are known to model both content and motion in the same network.
The code and the pre-trained model will soon be made publicly available.
arXiv Detail & Related papers (2022-10-20T11:54:32Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - PoseTrackReID: Dataset Description [97.7241689753353]
Pose information is helpful to disentangle useful feature information from background or occlusion noise.
With PoseTrackReID, we want to bridge the gap between person re-ID and multi-person pose tracking.
This dataset provides a good benchmark for current state-of-the-art methods on multi-frame person re-ID.
arXiv Detail & Related papers (2020-11-12T07:44:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.