Related papers: Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

URL: http://arxiv.org/abs/2407.06642v2
Date: Thu, 18 Jul 2024 15:34:04 GMT
Title: Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
Authors: Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li,
Abstract summary: We propose a novel reinforcement learning framework for personalized text-to-image generation. Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
Score: 40.06403155373455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.

Related papers

GMAIL: Generative Modality Alignment for generated Image Learning [51.071351994330605]
We propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images.<n>Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments.
arXiv Detail & Related papers (2026-02-17T05:40:25Z)
Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models [9.94436942959918]
A text-to-image diffusion model learns a new visual concept from a limited number of reference images.<n>We propose a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions.<n>This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions.
arXiv Detail & Related papers (2025-11-27T09:16:33Z)
DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition [69.10628479553709]
We introduce DRC, a novel personalized image generation framework that enhances Large Multimodal Models (LMMs) DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively. It involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation.
arXiv Detail & Related papers (2025-04-24T08:10:10Z)
Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance [46.922018440110826]
We present a training-free approach for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our method achieves outstanding image-to-image translation performance on various tasks when combined with the pretrained Stable Diffusion model.
arXiv Detail & Related papers (2024-12-20T11:15:31Z)
Dataset Augmentation by Mixing Visual Concepts [3.5420134832331334]
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. We adapt the diffusion model by conditioning it with real images and novel text embeddings. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
arXiv Detail & Related papers (2024-12-19T19:42:22Z)
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset. We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model. Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z)
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures. We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models [38.14123683674355]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting. Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.