IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation
- URL: http://arxiv.org/abs/2505.10743v1
- Date: Thu, 15 May 2025 23:08:52 GMT
- Title: IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation
- Authors: Amritanshu Tiwari, Cherish Puniani, Kaustubh Sharma, Ojasva Nema,
- Abstract summary: We propose a two-stage pipeline that addresses the challenges of personalizing text-to-image models.<n>First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label.<n>Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline.<n>Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational overhead.We propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA weights.This framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.
Related papers
- DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition [69.10628479553709]
We introduce DRC, a novel personalized image generation framework that enhances Large Multimodal Models (LMMs)<n> DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively.<n>It involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation.
arXiv Detail & Related papers (2025-04-24T08:10:10Z) - Towards Effective User Attribution for Latent Diffusion Models via Watermark-Informed Blending [54.26862913139299]
We introduce a novel framework Towards Effective user Attribution for latent diffusion models via Watermark-Informed Blending (TEAWIB)<n> TEAWIB incorporates a unique ready-to-use configuration approach that allows seamless integration of user-specific watermarks into generative models.<n>Experiments validate the effectiveness of TEAWIB, showcasing the state-of-the-art performance in perceptual quality and attribution accuracy.
arXiv Detail & Related papers (2024-09-17T07:52:09Z) - EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance [20.430259028981094]
Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image.<n>Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other.<n>We introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder.
arXiv Detail & Related papers (2024-09-12T14:44:45Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer
Level Loss [6.171638819257848]
Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality.
Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability.
We introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively.
arXiv Detail & Related papers (2024-01-05T07:21:46Z) - Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size
HD Images [56.17404812357676]
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes.
We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size.
We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
arXiv Detail & Related papers (2023-08-31T09:27:56Z) - Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion
Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models.
Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z) - SDXL: Improving Latent Diffusion Models for High-Resolution Image
Synthesis [8.648456572970035]
We present SDXL, a latent diffusion model for text-to-image synthesis.
Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone.
We show drastically improved performance compared the previous versions of Stable Diffusion.
arXiv Detail & Related papers (2023-07-04T23:04:57Z) - Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA [64.10981296843609]
We show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially.
We propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model.
We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification.
arXiv Detail & Related papers (2023-04-12T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.