DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations
- URL: http://arxiv.org/abs/2403.06951v2
- Date: Tue, 12 Mar 2024 03:38:13 GMT
- Title: DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations
- Authors: Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang
Chen, Qian He, Yongdong Zhang
- Abstract summary: Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
- Score: 64.43387739794531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The diffusion-based text-to-image model harbors immense potential in
transferring reference style. However, current encoder-based approaches
significantly impair the text controllability of text-to-image models while
transferring styles. In this paper, we introduce DEADiff to address this issue
using the following two strategies: 1) a mechanism to decouple the style and
semantics of reference images. The decoupled feature representations are first
extracted by Q-Formers which are instructed by different text descriptions.
Then they are injected into mutually exclusive subsets of cross-attention
layers for better disentanglement. 2) A non-reconstructive learning method. The
Q-Formers are trained using paired images rather than the identical target, in
which the reference image and the ground-truth image are with the same style or
semantics. We show that DEADiff attains the best visual stylization results and
optimal balance between the text controllability inherent in the text-to-image
model and style similarity to the reference image, as demonstrated both
quantitatively and qualitatively. Our project page is
https://tianhao-qi.github.io/DEADiff/.
Related papers
- Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning.
RLDF is a singular approach for visual imitation through prior-preserving reward function guidance.
It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z) - Self-supervised Cross-view Representation Reconstruction for Change
Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z) - DSI2I: Dense Style for Unpaired Image-to-Image Translation [70.93865212275412]
Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar.
We propose to represent style as a dense feature map, allowing for a finer-grained transfer to the source image without requiring any external semantic information.
Our results show that the translations produced by our approach are more diverse, preserve the source content better, and are closer to the exemplars when compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-26T18:45:25Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.