Subject-Consistent and Pose-Diverse Text-to-Image Generation
- URL: http://arxiv.org/abs/2507.08396v1
- Date: Fri, 11 Jul 2025 08:15:56 GMT
- Title: Subject-Consistent and Pose-Diverse Text-to-Image Generation
- Authors: Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai,
- Abstract summary: We propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi.<n>It enables consistent subject generation with diverse pose and layout.<n>CoDi achieves both better visual perception and stronger performance across all metrics.
- Score: 36.67159307721023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.
Related papers
- Noise Consistency Regularization for Improved Subject-Driven Image Synthesis [55.75426086791612]
Fine-tuning Stable Diffusion enables subject-driven image synthesis by adapting the model to generate images containing specific subjects.<n>Existing fine-tuning methods suffer from two key issues: underfitting, where the model fails to reliably capture subject identity, and overfitting, where it memorizes the subject image and reduces background diversity.<n>We propose two auxiliary consistency losses for diffusion fine-tuning. First, a prior consistency regularization loss ensures that the predicted diffusion noise for prior (non-subject) images remains consistent with that of the pretrained model, improving fidelity.
arXiv Detail & Related papers (2025-06-06T19:17:37Z) - DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis [3.6294581578004332]
We propose DynASyn, an effective multi-subject personalization from a single reference image.<n>DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions.<n>In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity.
arXiv Detail & Related papers (2025-03-22T10:56:35Z) - IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait [51.18967854258571]
IC-Portrait is a novel framework designed to accurately encode individual identities for personalized portrait generation.<n>Our key insight is that pre-trained diffusion models are fast learners for in-context dense correspondence matching.<n>We show that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2025-01-28T18:59:03Z) - EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance [20.430259028981094]
Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image.<n>Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other.<n>We introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder.
arXiv Detail & Related papers (2024-09-12T14:44:45Z) - Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation [40.969861849933444]
We propose a novel P-T2I method called Layout-and-Retouch, consisting of two stages: 1) layout generation and 2) retouch.
In the first stage, our step-blended inference utilizes the inherent sample diversity of vanilla T2I models to produce diversified layout images.
In the second stage, multi-source attention swaps the context image from the first stage with the reference image, leveraging the structure from the context image and extracting visual features from the reference image.
arXiv Detail & Related papers (2024-07-13T05:28:45Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.<n>The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects.
By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z) - DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation [71.87682778102236]
We propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture.
DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks.
arXiv Detail & Related papers (2022-09-03T06:13:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.