PrimeComposer: Faster Progressively Combined Diffusion for Image
Composition with Attention Steering
- URL: http://arxiv.org/abs/2403.05053v1
- Date: Fri, 8 Mar 2024 04:58:49 GMT
- Title: PrimeComposer: Faster Progressively Combined Diffusion for Image
Composition with Attention Steering
- Authors: Yibin Wang and Weizhong Zhang and Jianwei Zheng and Cheng Jin
- Abstract summary: We formulate image composition as a subject-based local editing task, solely focusing on foreground generation.
We propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels.
Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.
- Score: 15.059651360660073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image composition involves seamlessly integrating given objects into a
specific visual context. The current training-free methods rely on composing
attention weights from several samplers to guide the generator. However, since
these weights are derived from disparate contexts, their combination leads to
coherence confusion in synthesis and loss of appearance information. These
issues worsen with their excessive focus on background generation, even when
unnecessary in this task. This not only slows down inference but also
compromises foreground generation quality. Moreover, these methods introduce
unwanted artifacts in the transition area. In this paper, we formulate image
composition as a subject-based local editing task, solely focusing on
foreground generation. At each step, the edited foreground is combined with the
noisy background to maintain scene consistency. To address the remaining
issues, we propose PrimeComposer, a faster training-free diffuser that
composites the images by well-designed attention steering across different
noise levels. This steering is predominantly achieved by our Correlation
Diffuser, utilizing its self-attention layers at each step. Within these
layers, the synthesized subject interacts with both the referenced object and
background, capturing intricate details and coherent relationships. This prior
information is encoded into the attention weights, which are then integrated
into the self-attention layers of the generator to guide the synthesis process.
Besides, we introduce a Region-constrained Cross-Attention to confine the
impact of specific subject-related words to desired regions, addressing the
unwanted artifacts shown in the prior method thereby further improving the
coherence in the transition area. Our method exhibits the fastest inference
efficiency and extensive experiments demonstrate our superiority both
qualitatively and quantitatively.
Related papers
- Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image.
We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z) - Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation [22.949365270116335]
We propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time.
Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation.
arXiv Detail & Related papers (2024-05-11T08:11:25Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Enhancing Object Coherence in Layout-to-Image Synthesis [13.785484396436367]
We propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules.
For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images.
To improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence into each pixel's generation process.
arXiv Detail & Related papers (2023-11-17T13:43:43Z) - Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects.
By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z) - Take a Prior from Other Tasks for Severe Blur Removal [52.380201909782684]
Cross-level feature learning strategy based on knowledge distillation to learn the priors.
Semantic prior embedding layer with multi-level aggregation and semantic attention transformation to integrate the priors effectively.
Experiments on natural image deblurring benchmarks and real-world images, such as GoPro and RealBlur datasets, demonstrate our method's effectiveness and ability.
arXiv Detail & Related papers (2023-02-14T08:30:51Z) - Deep Image Compositing [93.75358242750752]
We propose a new method which can automatically generate high-quality image composites without any user input.
Inspired by Laplacian pyramid blending, a dense-connected multi-stream fusion network is proposed to effectively fuse the information from the foreground and background images.
Experiments show that the proposed method can automatically generate high-quality composites and outperforms existing methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-11-04T06:12:24Z) - Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts.
The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated.
To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space''
This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.