Related papers: Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

URL: http://arxiv.org/abs/2507.20094v1
Date: Sun, 27 Jul 2025 01:32:13 GMT
Title: Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models
Authors: Ankit Sanjyal,
Abstract summary: We propose a simple, training-free architectural method called Local Prompt Adaptation (LPA)<n>Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net's attention layers at different stages.<n>We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have become a powerful backbone for text-to-image generation, enabling users to synthesize high-quality visuals from natural language prompts. However, they often struggle with complex prompts involving multiple objects and global or local style specifications. In such cases, the generated scenes tend to lack style uniformity and spatial coherence, limiting their utility in creative and controllable content generation. In this paper, we propose a simple, training-free architectural method called Local Prompt Adaptation (LPA). Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net's attention layers at different stages. By conditioning object tokens early and style tokens later in the generation process, LPA enhances both layout control and stylistic consistency. We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL. Our approach outperforms prior work on both CLIP score and style consistency metrics, offering a new direction for controllable, expressive diffusion-based generation.

Related papers

StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling [5.12285618196312]
StyDeco is an unsupervised framework that learns text representations specifically tailored for the style transfer task.<n>Our framework outperforms several existing approaches in both stylistic fidelity and structural preservation.
arXiv Detail & Related papers (2025-08-02T06:17:23Z)
ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization [0.0]
ICAS is a novel framework for efficient and controllable multi-subject style transfer.<n>Our framework ensures faithful global layout preservation alongside accurate local style synthesis.<n>ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency.
arXiv Detail & Related papers (2025-04-17T10:48:11Z)
ObjMST: An Object-Focused Multimodal Style Transfer Framework [2.732041684677653]
We propose an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements.<n>Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surroundings.<n>Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image
arXiv Detail & Related papers (2025-03-06T11:55:44Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.<n>They struggle to support the consistent generation of identity-preserving requirements for storytelling.<n>We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z)
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z)
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts. We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z)
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z)
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents [3.229105662984031]
GestureDiffuCLIP is a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator. Our system can be extended to allow fine-grained style control of individual body parts.
arXiv Detail & Related papers (2023-03-26T03:35:46Z)
Prototype-to-Style: Dialogue Generation with Style-Aware Editing on Retrieval Memory [65.98002918470543]
We introduce a new prototype-to-style framework to tackle the challenge of stylistic dialogue generation. The framework uses an Information Retrieval (IR) system and extracts a response prototype from the retrieved response. A stylistic response generator then takes the prototype and the desired language style as model input to obtain a high-quality and stylistic response.
arXiv Detail & Related papers (2020-04-05T14:36:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.