Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2306.14408v3
- Date: Sun, 14 Jul 2024 21:51:28 GMT
- Title: Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
- Authors: Luozhou Wang, Guibao Shen, Wenhang Ge, Guangyong Chen, Yijun Li, Ying-cong Chen,
- Abstract summary: We present a training free approach called Text-Anchored Score Composition (TASC) to improve controllability of existing models.
We propose an attention operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back.
- Score: 35.02969643344228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models have advanced towards more controllable generation via supporting various additional conditions (e.g.,depth map, bounding box) beyond text. However, these models are learned based on the premise of perfect alignment between the text and extra conditions. If this alignment is not satisfied, the final output could be either dominated by one condition, or ambiguity may arise, failing to meet user expectations. To address this issue, we present a training free approach called Text-Anchored Score Composition (TASC) to further improve the controllability of existing models when provided with partially aligned conditions. The TASC firstly separates conditions based on pair relationships, computing the result individually for each pair. This ensures that each pair no longer has conflicting conditions. Then we propose an attention realignment operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back. Both qualitative and quantitative results demonstrate the effectiveness of our approach in handling unaligned conditions, which performs favorably against recent methods and more importantly adds flexibility to the controllable image generation process. Our code will be available at: https://github.com/EnVision-Research/Decompose-and-Realign.
Related papers
- Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis [43.481539150288434]
This work introduces a new family of.
factor graph Diffusion Models (FG-DMs)
FG-DMs models the joint distribution of.
images and conditioning variables, such as semantic, sketch,.
deep or normal maps via a factor graph decomposition.
arXiv Detail & Related papers (2024-10-29T00:54:00Z) - Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval [66.61856014573742]
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description.
Previous methods have attempted to align text and image samples in a modal-shared space.
We propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample.
arXiv Detail & Related papers (2024-06-09T03:06:55Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis [62.29033292210752]
High-quality images with consistent semantics and layout remains a challenge.
We propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues.
Our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment.
arXiv Detail & Related papers (2024-03-04T09:03:16Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Refign: Align and Refine for Adaptation of Semantic Segmentation to
Adverse Conditions [78.71745819446176]
Refign is a generic extension to self-training-based UDA methods which leverages cross-domain correspondences.
Refign consists of two steps: (1) aligning the normal-condition image to the corresponding adverse-condition image using an uncertainty-aware dense matching network, and (2) refining the adverse prediction with the normal prediction using an adaptive label correction mechanism.
The approach introduces no extra training parameters, minimal computational overhead -- during training only -- and can be used as a drop-in extension to improve any given self-training-based UDA method.
arXiv Detail & Related papers (2022-07-14T11:30:38Z) - Macroscopic Control of Text Generation for Image Captioning [4.742874328556818]
Two novel methods are introduced to solve the problems respectively.
For the former problem, we introduce a control signal which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc.
For the latter problem, we innovatively propose a strategy that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one.
arXiv Detail & Related papers (2021-01-20T07:20:07Z) - Rationalizing Text Matching: Learning Sparse Alignments via Optimal
Transport [14.86310501896212]
In this work, we extend this selective rationalization approach to text matching.
The goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction.
Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs.
arXiv Detail & Related papers (2020-05-27T01:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.