Related papers: CharaConsist: Fine-Grained Consistent Character Generation

CharaConsist: Fine-Grained Consistent Character Generation

URL: http://arxiv.org/abs/2507.11533v1
Date: Tue, 15 Jul 2025 17:58:08 GMT
Title: CharaConsist: Fine-Grained Consistent Character Generation
Authors: Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, Yunchao Wei,
Abstract summary: CharaConsist is first consistent generation method tailored for text-to-image DiT model.<n>CharaConsist enables fine-grained consistency for both foreground and background.<n>Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs.
Score: 93.08900337098302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist

Related papers

Subject-Consistent and Pose-Diverse Text-to-Image Generation [36.67159307721023]
We propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi.<n>It enables consistent subject generation with diverse pose and layout.<n>CoDi achieves both better visual perception and stronger performance across all metrics.
arXiv Detail & Related papers (2025-07-11T08:15:56Z)
Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling [5.713041172936274]
Cross-frame self-attention improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation.<n>Our exploration reveals that self-attention-leakage is exacerbated when trying to ensure consistency across multiple characters.<n>Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency.
arXiv Detail & Related papers (2025-04-08T08:30:55Z)
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z)
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation [10.652011707000202]
We introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency. StoryMaker supports numerous applications and is compatible with other societal plug-ins.
arXiv Detail & Related papers (2024-09-19T08:53:06Z)
Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling [77.08568533331206]
We propose a novel multi-condition guided framework for character image animation.<n>We employ several well-designed input modules to enhance the implicit decoupling capability of the model.<n>Our method excels in generating high-quality character animations, especially in scenarios of complex backgrounds and multiple characters.
arXiv Detail & Related papers (2024-06-05T08:03:18Z)
Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image. Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z)
Masked Generative Story Transformer with Character Guidance and Caption Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately. We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z)
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation. Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.