Customization Assistant for Text-to-image Generation
- URL: http://arxiv.org/abs/2312.03045v2
- Date: Wed, 8 May 2024 21:46:34 GMT
- Title: Customization Assistant for Text-to-image Generation
- Authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun,
- Abstract summary: We propose a new framework consists of a new model design and a novel training strategy.
The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning.
- Score: 40.76198867803018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Customizing pre-trained text-to-image generation model has attracted massive research interest recently, due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image, their capability are still far from perfection. Specifically, most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning, while their performance are unsatisfactory. Furthermore, the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work, we build a customization assistant based on pre-trained large language model and diffusion model, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically, we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted, competitive results have been obtained across different domains, illustrating the effectiveness of the proposed method.
Related papers
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Training-free Editioning of Text-to-Image Models [47.32550822603952]
We propose a novel task, namely, training-free editioning, for text-to-image models.
We aim to create variations of a base text-to-image model without retraining.
Our proposed editioning paradigm enables a service provider to customize the base model into its "cat edition"
arXiv Detail & Related papers (2024-05-27T11:40:50Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Generating Illustrated Instructions [41.613203340244155]
We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs.
We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion.
arXiv Detail & Related papers (2023-12-07T18:59:20Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Enhancing Detail Preservation for Customized Text-to-Image Generation: A
Regularization-Free Approach [43.53330622723175]
We propose a novel framework for customized text-to-image generation without the use of regularization.
With the proposed framework, we are able to customize a large-scale text-to-image generation model within half a minute on single GPU.
arXiv Detail & Related papers (2023-05-23T01:14:53Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.