VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
- URL: http://arxiv.org/abs/2412.20800v1
- Date: Mon, 30 Dec 2024 08:47:25 GMT
- Title: VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
- Authors: Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He,
- Abstract summary: diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images.
We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter.
Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
- Score: 8.685610154314459
- License:
- Abstract: While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.
Related papers
- ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps.
We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z) - ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images.
We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z) - DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models [18.44432223381586]
Recently, a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks.
In these techniques, two or more randomly selected natural images are mixed together to generate an augmented image.
We propose DiffuseMix, a novel data augmentation technique that leverages a diffusion model to reshape training images.
arXiv Detail & Related papers (2024-04-05T05:31:02Z) - FreePIH: Training-Free Painterly Image Harmonization with Diffusion
Model [19.170302996189335]
Our FreePIH method tames the denoising process as a plug-in module for foreground image style transfer.
We make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space.
Our method can surpass representative baselines by large margins.
arXiv Detail & Related papers (2023-11-25T04:23:49Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - SMMix: Self-Motivated Image Mixing for Vision Transformers [65.809376136455]
CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs)
Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels.
We propose an efficient and effective Self-Motivated image Mixing method (SMMix) which motivates both image and label enhancement by the model under training itself.
arXiv Detail & Related papers (2022-12-26T00:19:39Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - MagicMix: Semantic Mixing with Diffusion Models [85.43291162563652]
We explore a new task called semantic mixing, aiming at blending two different semantics to create a new concept.
We present MagicMix, a solution based on pre-trained text-conditioned diffusion models.
Our method does not require any spatial mask or re-training, yet is able to synthesize novel objects with high fidelity.
arXiv Detail & Related papers (2022-10-28T11:07:48Z) - Bridging the Visual Gap: Wide-Range Image Blending [16.464837892640812]
We introduce an effective deep-learning model to realize wide-range image blending.
We experimentally demonstrate that our proposed method is able to produce visually appealing results.
arXiv Detail & Related papers (2021-03-28T15:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.