ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation
- URL: http://arxiv.org/abs/2509.18092v2
- Date: Thu, 16 Oct 2025 05:30:16 GMT
- Title: ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation
- Authors: Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman,
- Abstract summary: We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance.<n>Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model.<n>This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image.
- Score: 39.34778197087224
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.
Related papers
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization [82.31106470150844]
We introduce Omni-Attribute, the first open-vocabulary image attribute encoder to learn attribute-specific representations.<n>We use a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement.<n>The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation.
arXiv Detail & Related papers (2025-12-11T18:59:56Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation [19.096741614175524]
Parts2Whole is a novel framework designed for generating customized portraits from multiple reference images.
We first develop a semantic-aware appearance encoder to retain details of different human parts.
Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism.
arXiv Detail & Related papers (2024-04-23T17:56:08Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images.
We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models.
Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Controllable Person Image Synthesis with Attribute-Decomposed GAN [27.313729413684012]
This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis.
The core idea of the proposed model is to embed human attributes into the latent space as independent codes.
Experimental results demonstrate the proposed method's superiority over the state of the art in pose transfer.
arXiv Detail & Related papers (2020-03-27T07:47:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.