Related papers: IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation

IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation

URL: http://arxiv.org/abs/2505.10743v1
Date: Thu, 15 May 2025 23:08:52 GMT
Title: IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation
Authors: Amritanshu Tiwari, Cherish Puniani, Kaustubh Sharma, Ojasva Nema,
Abstract summary: We propose a two-stage pipeline that addresses the challenges of personalizing text-to-image models.<n>First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label.<n>Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline.<n>Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational overhead.We propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA weights.This framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.

Related papers

DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition [69.10628479553709]
We introduce DRC, a novel personalized image generation framework that enhances Large Multimodal Models (LMMs)<n> DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively.<n>It involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation.
arXiv Detail & Related papers (2025-04-24T08:10:10Z)
Towards Effective User Attribution for Latent Diffusion Models via Watermark-Informed Blending [54.26862913139299]
We introduce a novel framework Towards Effective user Attribution for latent diffusion models via Watermark-Informed Blending (TEAWIB)<n> TEAWIB incorporates a unique ready-to-use configuration approach that allows seamless integration of user-specific watermarks into generative models.<n>Experiments validate the effectiveness of TEAWIB, showcasing the state-of-the-art performance in perceptual quality and attribution accuracy.
arXiv Detail & Related papers (2024-09-17T07:52:09Z)
EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance [20.430259028981094]
Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image.<n>Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other.<n>We introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder.
arXiv Detail & Related papers (2024-09-12T14:44:45Z)
LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers. We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD. Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z)
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss [6.171638819257848]
Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. We introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively.
arXiv Detail & Related papers (2024-01-05T07:21:46Z)
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images [56.17404812357676]
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes. We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size. We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
arXiv Detail & Related papers (2023-08-31T09:27:56Z)
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z)
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis [8.648456572970035]
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone. We show drastically improved performance compared the previous versions of Stable Diffusion.
arXiv Detail & Related papers (2023-07-04T23:04:57Z)
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA [64.10981296843609]
We show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. We propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification.
arXiv Detail & Related papers (2023-04-12T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.