Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models
- URL: http://arxiv.org/abs/2312.06712v2
- Date: Wed, 31 Jan 2024 18:44:22 GMT
- Title: Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models
- Authors: Zhipeng Bao and Yijun Li and Krishna Kumar Singh and Yu-Xiong Wang and
Martial Hebert
- Abstract summary: This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
- Score: 58.46926334842161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent significant strides achieved by diffusion-based Text-to-Image
(T2I) models, current systems are still less capable of ensuring decent
compositional generation aligned with text prompts, particularly for the
multi-object generation. This work illuminates the fundamental reasons for such
misalignment, pinpointing issues related to low attention activation scores and
mask overlaps. While previous research efforts have individually tackled these
issues, we assert that a holistic approach is paramount. Thus, we propose two
novel objectives, the Separate loss and the Enhance loss, that reduce object
mask overlaps and maximize attention scores, respectively. Our method diverges
from conventional test-time-adaptation techniques, focusing on finetuning
critical parameters, which enhances scalability and generalizability.
Comprehensive evaluations demonstrate the superior performance of our model in
terms of image realism, text-image alignment, and adaptability, notably
outperforming prominent baselines. Ultimately, this research paves the way for
T2I diffusion models with enhanced compositional capacities and broader
applicability.
Related papers
- DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-Resolution [19.33582308829547]
This paper proposes to leverage degradation-aligned language prompt for accurate, fine-grained, and high-fidelity image restoration.
The proposed method achieves a new state-of-the-art perceptual quality level.
arXiv Detail & Related papers (2024-06-24T09:30:36Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for
Loss-free Multi-Exposure Image Fusion [60.221404321514086]
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels.
This paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions.
arXiv Detail & Related papers (2023-09-03T08:07:26Z) - Grounded Text-to-Image Synthesis with Attention Refocusing [16.9170825951175]
We reveal the potential causes in the diffusion model's cross-attention and self-attention layers.
We propose two novel losses to refocus attention maps according to a given spatial layout during sampling.
We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
arXiv Detail & Related papers (2023-06-08T17:59:59Z) - Training-Free Structured Diffusion Guidance for Compositional
Text-to-Image Synthesis [78.28620571530706]
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks.
We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.
arXiv Detail & Related papers (2022-12-09T18:30:24Z) - Robust Single Image Dehazing Based on Consistent and Contrast-Assisted
Reconstruction [95.5735805072852]
We propose a novel density-variational learning framework to improve the robustness of the image dehzing model.
Specifically, the dehazing network is optimized under the consistency-regularized framework.
Our method significantly surpasses the state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T08:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.