AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
- URL: http://arxiv.org/abs/2502.08189v2
- Date: Wed, 21 May 2025 05:46:18 GMT
- Title: AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
- Authors: Zhao Wang, Hao Wen, Lingting Zhu, Chenming Shang, Yujiu Yang, Qi Dou,
- Abstract summary: We propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes.<n>In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance.<n>The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one.
- Score: 36.27326882135989
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Extensive experimental results demonstrate the superiority of our method compared with previous state-of-the-art methods.
Related papers
- Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router [72.29811385678168]
We introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene.<n>Specifically, we propose a novel framework incorporating a fine-grained Embedding Router that binds who' and speak what' together to address the audio-to-character correspondence control.
arXiv Detail & Related papers (2025-06-24T17:50:16Z) - Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z) - Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance [29.768141136041454]
We propose a novel multi-character video generation framework, which is based on the separated text and pose guidance.<n>Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs.<n>The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation.
arXiv Detail & Related papers (2024-12-21T05:49:40Z) - Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - StoryGPT-V: Large Language Models as Consistent Story Visualizers [33.68157535461168]
generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts.
Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references.
We introduce emphStoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters.
arXiv Detail & Related papers (2023-12-04T18:14:29Z) - Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free
Videos [107.65147103102662]
In this work, we utilize datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation.
In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.
arXiv Detail & Related papers (2023-04-03T17:55:14Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations.
After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components.
We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z) - Playable Video Generation [47.531594626822155]
We aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game.
The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input.
We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos.
arXiv Detail & Related papers (2021-01-28T18:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.