Related papers: Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

URL: http://arxiv.org/abs/2406.03035v3
Date: Thu, 13 Jun 2024 01:17:18 GMT
Title: Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control
Authors: Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, Wenhan Luo,
Abstract summary: Follow-Your-Pose v2 can be trained on noisy open-sourced videos readily available on the internet. Our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics.
Score: 77.08568533331206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.

Related papers

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses [57.17501809717155]
We present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Our key insight is that human images naturally exhibit multiple levels of correlation. We construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations.
arXiv Detail & Related papers (2024-11-30T08:42:13Z)
Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation [19.408715783816167]
We introduce a training-free framework that ensures the generated video sequence preserves the reference image's subtleties. We decouple skeletal and motion priors from pose information, enabling precise control over animation generation. Our method significantly enhances the quality of video generation without the need for large datasets or expensive computational resources.
arXiv Detail & Related papers (2024-08-29T13:08:12Z)
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation. We map the reference image along with the posture guidance and noise video into a common feature space. We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z)
VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability. An identity-aware appearance controller integrates additional facial information without compromising other appearance details. A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z)
Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image. Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z)
CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization [27.55341255800119]
We present CharacterGen, a framework developed to efficiently generate 3D characters. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach. We have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model.
arXiv Detail & Related papers (2024-02-27T05:10:59Z)
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. In this paper, we propose a novel framework tailored for character animation. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z)
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras. Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z)
Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z)
Dual Attention on Pyramid Feature Maps for Image Captioning [11.372662279301522]
We propose to apply dual attention on pyramid image feature maps to explore the visual-semantic correlations and improve the quality of generated sentences. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO. Our composite captioning model achieves very promising performance in a single-model mode.
arXiv Detail & Related papers (2020-11-02T23:42:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.