Direct Consistency Optimization for Compositional Text-to-Image
Personalization
- URL: http://arxiv.org/abs/2402.12004v1
- Date: Mon, 19 Feb 2024 09:52:41 GMT
- Title: Direct Consistency Optimization for Compositional Text-to-Image
Personalization
- Authors: Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin
- Abstract summary: Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
- Score: 73.94505688626651
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) diffusion models, when fine-tuned on a few personal
images, are able to generate visuals with a high degree of consistency.
However, they still lack in synthesizing images of different scenarios or
styles that are possible in the original pretrained models. To address this, we
propose to fine-tune the T2I model by maximizing consistency to reference
images, while penalizing the deviation from the pretrained model. We devise a
novel training objective for T2I diffusion models that minimally fine-tunes the
pretrained model to achieve consistency. Our method, dubbed \emph{Direct
Consistency Optimization}, is as simple as regular diffusion loss, while
significantly enhancing the compositionality of personalized T2I models. Also,
our approach induces a new sampling method that controls the tradeoff between
image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using
a comprehensive caption for reference images to further enhance the image-text
alignment. We show the efficacy of the proposed method on the T2I
personalization for subject, style, or both. In particular, our method results
in a superior Pareto frontier to the baselines. Generated examples and codes
are in our project page( https://dco-t2i.github.io/).
Related papers
- PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction [38.424899483761656]
PaRa is an effective and efficient Rank Reduction approach for T2I model personalization.
Our design is motivated by the fact that taming a T2I model toward a novel concept implies a small generation space.
We show that PaRa achieves great advantages over existing finetuning approaches on single/multi-subject generation as well as single-image editing.
arXiv Detail & Related papers (2024-06-09T04:51:51Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.