Related papers: Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

URL: http://arxiv.org/abs/2404.05384v1
Date: Mon, 8 Apr 2024 10:45:29 GMT
Title: Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
Authors: Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu,
Abstract summary: We present a novel approach to customize the guidance degrees for different semantic units in text-to-image diffusion models. We adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models.
Score: 17.29693696084235
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.

Related papers

Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel Interacting with Language [3.562621045863125]
We propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. We align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. We develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes.
arXiv Detail & Related papers (2025-01-07T03:00:58Z)
HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior [62.04939047885834]
We present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed spatial-CLIP Map.
arXiv Detail & Related papers (2024-11-27T15:22:44Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation. FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z)
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation [35.44771460784343]
Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS) Existing methods still struggle to preserve semantically-consistent local details between the original and translated images. We present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation.
arXiv Detail & Related papers (2023-08-23T18:01:01Z)
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain. With only a single representative text feature instead of real images, the synthesized images gradually lose diversity. We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z)
Unsupervised Domain Adaptation for Semantic Segmentation using One-shot Image-to-Image Translation via Latent Representation Mixing [9.118706387430883]
We propose a new unsupervised domain adaptation method for the semantic segmentation of very high resolution images. An image-to-image translation paradigm is proposed, based on an encoder-decoder principle where latent content representations are mixed across domains. Cross-city comparative experiments have shown that the proposed method outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2022-12-07T18:16:17Z)
Federated Domain Generalization for Image Recognition via Cross-Client Style Transfer [60.70102634957392]
Domain generalization (DG) has been a hot topic in image recognition, with a goal to train a general model that can perform well on unseen domains. In this paper, we propose a novel domain generalization method for image recognition through cross-client style transfer (CCST) without exchanging data samples. Our method outperforms recent SOTA DG methods on two DG benchmarks (PACS, OfficeHome) and a large-scale medical image dataset (Camelyon17) in the FL setting.
arXiv Detail & Related papers (2022-10-03T13:15:55Z)
Diffusion-based Image Translation using Disentangled Style and Content Representation [51.188396199083336]
Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer. It is often difficult to maintain the original content of the image during the reverse diffusion. We present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
arXiv Detail & Related papers (2022-09-30T06:44:37Z)
Language-aware Domain Generalization Network for Cross-Scene Hyperspectral Image Classification [15.842081807249416]
It is necessary to explore the effectiveness of linguistic mode in assisting hyperspectral image classification. Large-scale pre-training image-text foundation models have demonstrated great performance in a variety of downstream applications. A Language-aware Domain Generalization Network (LDGnet) is proposed to learn cross-domain invariant representation.
arXiv Detail & Related papers (2022-09-06T10:06:10Z)
Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance. We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations. AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z)
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains. Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.