Related papers: PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

URL: http://arxiv.org/abs/2403.05053v3
Date: Tue, 20 Aug 2024 05:14:00 GMT
Title: PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering
Authors: Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin,
Abstract summary: We formulate image composition as a subject-based local editing task, solely focusing on foreground generation. We propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.
Score: 13.785484396436367
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image composition involves seamlessly integrating given objects into a specific visual context. Current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only impedes their swift implementation but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related tokens to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.

Related papers

Multi-view Image Diffusion via Coordinate Noise and Fourier Attention [5.251293630298169]
We propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism and cross-attention loss. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.
arXiv Detail & Related papers (2024-12-04T22:49:40Z)
Enhancing Conditional Image Generation with Explainable Latent Space Manipulation [0.0]
This paper proposes a novel approach to achieve fidelity to a reference image while adhering to conditional prompts. We analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features.
arXiv Detail & Related papers (2024-08-29T03:12:04Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation [22.949365270116335]
We propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation.
arXiv Detail & Related papers (2024-05-11T08:11:25Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)
Take a Prior from Other Tasks for Severe Blur Removal [52.380201909782684]
Cross-level feature learning strategy based on knowledge distillation to learn the priors. Semantic prior embedding layer with multi-level aggregation and semantic attention transformation to integrate the priors effectively. Experiments on natural image deblurring benchmarks and real-world images, such as GoPro and RealBlur datasets, demonstrate our method's effectiveness and ability.
arXiv Detail & Related papers (2023-02-14T08:30:51Z)
Deep Image Compositing [93.75358242750752]
We propose a new method which can automatically generate high-quality image composites without any user input. Inspired by Laplacian pyramid blending, a dense-connected multi-stream fusion network is proposed to effectively fuse the information from the foreground and background images. Experiments show that the proposed method can automatically generate high-quality composites and outperforms existing methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-11-04T06:12:24Z)
Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space'' This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.