Detector Guidance for Multi-Object Text-to-Image Generation
- URL: http://arxiv.org/abs/2306.02236v1
- Date: Sun, 4 Jun 2023 02:33:12 GMT
- Title: Detector Guidance for Multi-Object Text-to-Image Generation
- Authors: Luping Liu and Zijian Zhang and Yi Ren and Rongjie Huang and Xiang Yin
and Zhou Zhao
- Abstract summary: Detector Guidance (DG) integrates a latent object detection model to separate different objects during the generation process.
Human evaluations demonstrate that DG provides an 8-22% advantage in preventing the amalgamation of conflicting concepts.
- Score: 61.70018793720616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have demonstrated impressive performance in text-to-image
generation. They utilize a text encoder and cross-attention blocks to infuse
textual information into images at a pixel level. However, their capability to
generate images with text containing multiple objects is still restricted.
Previous works identify the problem of information mixing in the CLIP text
encoder and introduce the T5 text encoder or incorporate strong prior knowledge
to assist with the alignment. We find that mixing problems also occur on the
image side and in the cross-attention blocks. The noisy images can cause
different objects to appear similar, and the cross-attention blocks inject
information at a pixel level, leading to leakage of global object understanding
and resulting in object mixing. In this paper, we introduce Detector Guidance
(DG), which integrates a latent object detection model to separate different
objects during the generation process. DG first performs latent object
detection on cross-attention maps (CAMs) to obtain object information. Based on
this information, DG then masks conflicting prompts and enhances related
prompts by manipulating the following CAMs. We evaluate the effectiveness of DG
using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark,
MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in
preventing the amalgamation of conflicting concepts and ensuring that each
object possesses its unique region without any human involvement and additional
iterations. Our implementation is available at
\url{https://github.com/luping-liu/Detector-Guidance}.
Related papers
- Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network [24.54269823691119]
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives.
To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD.
All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets.
arXiv Detail & Related papers (2024-08-02T09:31:21Z) - ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer [31.46969412692045]
This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical imagery.
We propose an object-guided Transformer (Former) by naturally combining the object-based image analysis (OBIA) technique with the advanced vision Transformer architecture.
A large-scale benchmark dataset called OpenMapCD is constructed to conduct detailed experiments.
arXiv Detail & Related papers (2023-10-04T09:26:44Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Referring Camouflaged Object Detection [97.90911862979355]
Ref-COD aims to segment specified camouflaged objects based on a small set of referring images with salient target objects.
We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios.
arXiv Detail & Related papers (2023-06-13T04:15:37Z) - Image Segmentation-based Unsupervised Multiple Objects Discovery [1.7674345486888503]
Unsupervised object discovery aims to localize objects in images.
We propose a fully unsupervised, bottom-up approach, for multiple objects discovery.
We provide state-of-the-art results for both unsupervised class-agnostic object detection and unsupervised image segmentation.
arXiv Detail & Related papers (2022-12-20T09:48:24Z) - MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection [10.04773536815808]
We propose a behavior-inspired framework, called Multi-view Feature Fusion Network (MFFN), which mimics the human behaviors of finding indistinct objects in images.
MFFN captures critical edge and semantic information by comparing and fusing extracted multi-view features.
Our method performs favorably against existing state-of-the-art methods via training with the same data.
arXiv Detail & Related papers (2022-10-12T16:12:58Z) - Context-Aware Transfer Attacks for Object Detection [51.65308857232767]
We present a new approach to generate context-aware attacks for object detectors.
We show that by using co-occurrence of objects and their relative locations and sizes as context information, we can successfully generate targeted mis-categorization attacks.
arXiv Detail & Related papers (2021-12-06T18:26:39Z) - Saliency Detection via Global Context Enhanced Feature Fusion and Edge
Weighted Loss [6.112591965159383]
We propose a context fusion decoder network (CFDN) and near edge weighted loss (NEWLoss) function.
The CFDN creates an accurate saliency map by integrating global context information and thus suppressing the influence of the unnecessary spatial information.
NewLoss accelerates learning of obscure boundaries without additional modules by generating weight maps on object boundaries.
arXiv Detail & Related papers (2021-10-13T08:04:55Z) - Data Augmentation for Object Detection via Differentiable Neural
Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce.
Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data.
We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.