Related papers: ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

URL: http://arxiv.org/abs/2306.00971v2
Date: Thu, 7 Dec 2023 17:49:30 GMT
Title: ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation
Authors: Shaozhe Hao, Kai Han, Shihao Zhao, Kwan-Yee K. Wong
Abstract summary: We present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters. ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively.
Score: 22.608957437064213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo

Related papers

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images. We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z)
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation [45.52926475981602]
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. VILA-U employs a single autoregressive next-token prediction framework for both tasks.
arXiv Detail & Related papers (2024-09-06T17:49:56Z)
FSViewFusion: Few-Shots View Generation of Novel Objects [75.81872204650807]
We introduce a pretrained stable diffusion model for view synthesis without explicit 3D priors. Specifically, we base our method on a personalized text to image model, Dreambooth, given its strong ability to adapt to specific novel objects with a few shots. We establish that the concept of a view can be disentangled and transferred to a novel object irrespective of the original object's identify from which the views are learnt.
arXiv Detail & Related papers (2024-03-11T02:59:30Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. These pre-trained models often face challenges when it comes to generating highly aesthetic images. We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z)
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On [35.4056826207203]
This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module. We show that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
arXiv Detail & Related papers (2023-05-22T21:38:06Z)
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes. We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. We train an ensemble of text-to-image diffusion models specialized for different stages synthesis. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models. We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject. The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.