Related papers: Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

URL: http://arxiv.org/abs/2507.20094v2
Date: Sun, 17 Aug 2025 15:58:51 GMT
Title: Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models
Authors: Ankit Sanjyal,
Abstract summary: Local Prompt Adaptation (LPA) is a lightweight, training-free method that injects the prompt into content and style tokens.<n>On the T2I benchmark, LPA improves CLIP-prompt alignment over vanilla SDXL by +0.41% and over SD1.5 by +0.34%, with no diversity loss.<n>On our custom 50-prompt style-rich benchmark, LPA achieves +0.09% CLIP-prompt and +0.08% CLIP-style gains over baseline.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have become a powerful backbone for text-to-image generation, producing high-quality visuals from natural language prompts. However, when prompts involve multiple objects alongside global or local style instructions, the outputs often drift in style and lose spatial coherence, limiting their reliability for controlled, style-consistent scene generation. We present Local Prompt Adaptation (LPA), a lightweight, training-free method that splits the prompt into content and style tokens, then injects them selectively into the U-Net's attention layers at chosen timesteps. By conditioning object tokens early and style tokens later in the denoising process, LPA improves both layout control and stylistic uniformity without additional training cost. We conduct extensive ablations across parser settings and injection windows, finding that the best configuration -- lpa late only with a 300-650 step window -- delivers the strongest balance of prompt alignment and style consistency. On the T2I benchmark, LPA improves CLIP-prompt alignment over vanilla SDXL by +0.41% and over SD1.5 by +0.34%, with no diversity loss. On our custom 50-prompt style-rich benchmark, LPA achieves +0.09% CLIP-prompt and +0.08% CLIP-style gains over baseline. Our method is model-agnostic, easy to integrate, and requires only a single configuration change, making it a practical choice for controllable, style-consistent multi-object generation.

Related papers

MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation [12.481603155570037]
We propose textbfMMLoP (textbfMulti-textbfModal textbfLow-Rank textbfPrompting), a framework that achieves deep multi-modal prompting with only textbf11.5K trainable parameters.
arXiv Detail & Related papers (2026-02-24T22:00:34Z)
$β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment [53.42377319350806]
$$-CLIP is a multi-granular text-conditioned contrastive learning framework.<n>$$-CAL addresses the semantic overlap inherent in this hierarchy.<n>$$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence.
arXiv Detail & Related papers (2025-12-14T13:03:20Z)
Instant Preference Alignment for Text-to-Image Diffusion Models [29.85008982524577]
We propose a training-free framework grounded in multimodal large language model (MLLM) priors.<n>For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image.<n>For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation.
arXiv Detail & Related papers (2025-08-25T06:51:15Z)
StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling [5.12285618196312]
StyDeco is an unsupervised framework that learns text representations specifically tailored for the style transfer task.<n>Our framework outperforms several existing approaches in both stylistic fidelity and structural preservation.
arXiv Detail & Related papers (2025-08-02T06:17:23Z)
ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization [0.0]
ICAS is a novel framework for efficient and controllable multi-subject style transfer.<n>Our framework ensures faithful global layout preservation alongside accurate local style synthesis.<n>ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency.
arXiv Detail & Related papers (2025-04-17T10:48:11Z)
ObjMST: An Object-Focused Multimodal Style Transfer Framework [2.732041684677653]
We propose an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements.<n>Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surroundings.<n>Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image
arXiv Detail & Related papers (2025-03-06T11:55:44Z)
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars [57.6513924960128]
Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully.<n>This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment.
arXiv Detail & Related papers (2025-02-17T11:16:19Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.<n>They struggle to support the consistent generation of identity-preserving requirements for storytelling.<n>We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z)
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z)
Repetition Improves Language Model Embeddings [86.71985212601258]
"echo embeddings" convert autoregressive language models into strong text embedding models without changing the architecture or requiring fine-tuning.<n>Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training.
arXiv Detail & Related papers (2024-02-23T17:25:10Z)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts. We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z)
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z)
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents [3.229105662984031]
GestureDiffuCLIP is a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator. Our system can be extended to allow fine-grained style control of individual body parts.
arXiv Detail & Related papers (2023-03-26T03:35:46Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Prototype-to-Style: Dialogue Generation with Style-Aware Editing on Retrieval Memory [65.98002918470543]
We introduce a new prototype-to-style framework to tackle the challenge of stylistic dialogue generation. The framework uses an Information Retrieval (IR) system and extracts a response prototype from the retrieved response. A stylistic response generator then takes the prototype and the desired language style as model input to obtain a high-quality and stylistic response.
arXiv Detail & Related papers (2020-04-05T14:36:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.