ViCo: Plug-and-play Visual Condition for Personalized Text-to-image
Generation
- URL: http://arxiv.org/abs/2306.00971v2
- Date: Thu, 7 Dec 2023 17:49:30 GMT
- Title: ViCo: Plug-and-play Visual Condition for Personalized Text-to-image
Generation
- Authors: Shaozhe Hao, Kai Han, Shihao Zhao, Kwan-Yee K. Wong
- Abstract summary: We present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation.
ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters.
ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively.
- Score: 22.608957437064213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalized text-to-image generation using diffusion models has recently
emerged and garnered significant interest. This task learns a novel concept
(e.g., a unique toy), illustrated in a handful of images, into a generative
model that captures fine visual details and generates photorealistic images
based on textual embeddings. In this paper, we present ViCo, a novel
lightweight plug-and-play method that seamlessly integrates visual condition
into personalized text-to-image generation. ViCo stands out for its unique
feature of not requiring any fine-tuning of the original diffusion model
parameters, thereby facilitating more flexible and scalable model deployment.
This key advantage distinguishes ViCo from most existing models that
necessitate partial or full diffusion fine-tuning. ViCo incorporates an image
attention module that conditions the diffusion process on patch-wise visual
semantics, and an attention-based object mask that comes at no extra cost from
the attention module. Despite only requiring light parameter training (~6%
compared to the diffusion U-Net), ViCo delivers performance that is on par
with, or even surpasses, all state-of-the-art models, both qualitatively and
quantitatively. This underscores the efficacy of ViCo, making it a highly
promising solution for personalized text-to-image generation without the need
for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo
Related papers
- VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images.
We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter.
Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z) - VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation [45.52926475981602]
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation.
VILA-U employs a single autoregressive next-token prediction framework for both tasks.
arXiv Detail & Related papers (2024-09-06T17:49:56Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On [35.4056826207203]
This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task.
The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module.
We show that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
arXiv Detail & Related papers (2023-05-22T21:38:06Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.