Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention
- URL: http://arxiv.org/abs/2411.19261v2
- Date: Sun, 09 Mar 2025 13:39:55 GMT
- Title: Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention
- Authors: Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, Huan Yang,
- Abstract summary: We propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion.<n>Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject convergence.<n>Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images.
- Score: 25.69140705877128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals two primary issues contributing to this deficiency. Firstly, the undesired internal attraction between different subjects within the target image can lead to the convergence of multiple subjects into a single entity. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these issues, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject convergence. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that IR-Diffusion significantly enhances multi-subject consistency, outperforming all existing methods in open-domain scenarios.
Related papers
- FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement [9.25127189426397]
Current methods often overfit when trained on a small number of samples.
FaR is a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function.
Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2025-04-04T09:17:57Z) - Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.
We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.
This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z) - CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis [8.386261591495103]
We introduce CoCoNO, a new algorithm that optimize the initial latent by leveraging the complementary information within self-attention and cross-attention maps.
Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented.
arXiv Detail & Related papers (2024-11-25T08:20:14Z) - Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.
We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z) - Towards Cross-View-Consistent Self-Supervised Surround Depth Estimation [9.569646683579899]
Self-Supervised Surround Depth Estimation from consecutive images offers an economical alternative.
Previous SSSDE methods have proposed different mechanisms to fuse information across images, but few of them explicitly consider the cross-view constraints.
This paper proposes an efficient and consistent pose estimation design and two loss functions to enhance cross-view consistency for SSSDE.
arXiv Detail & Related papers (2024-07-04T16:29:05Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks.
We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens.
We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z) - HEAP: Unsupervised Object Discovery and Localization with Contrastive
Grouping [29.678756772610797]
Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision.
Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing self-supervised transformer features.
To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP)
arXiv Detail & Related papers (2023-12-29T06:46:37Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - Rethinking of the Image Salient Object Detection: Object-level Semantic
Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions.
We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement.
Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.