Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers
- URL: http://arxiv.org/abs/2504.10148v2
- Date: Mon, 21 Apr 2025 03:29:53 GMT
- Title: Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers
- Authors: Chunyang Zhang, Zhenhong Sun, Zhicheng Zhang, Junyan Wang, Yu Zhang, Dong Gong, Huadong Mo, Daoyi Dong,
- Abstract summary: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS)<n>Traditional MIS control methods for UNet architectures fail to adapt to DiT-based models.<n>We propose a training-free approach for enhancing MIS in DiT-based models.
- Score: 22.269573676129152
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.
Related papers
- LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations [18.728541981438216]
Existing text-to-image (T2I) models show decayed performance in compositional image generation involving multiple objects and intricate relationships.<n>We construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs.<n>We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation.
arXiv Detail & Related papers (2024-12-11T17:57:10Z) - Adaptive Large Language Models By Layerwise Attention Shortcuts [46.76681147411957]
LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism.
We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
arXiv Detail & Related papers (2024-09-17T03:46:01Z) - HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution [6.546896650921257]
We propose HiTSR, a hierarchical transformer model for reference-based image super-resolution.
We streamline the architecture and training pipeline by incorporating the double attention block from GAN literature.
Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109.
arXiv Detail & Related papers (2024-08-30T01:16:29Z) - Noise Contrastive Estimation-based Matching Framework for Low-Resource
Security Attack Pattern Recognition [49.536368818512116]
Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain.
We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two.
We propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism.
arXiv Detail & Related papers (2024-01-18T19:02:00Z) - Skeleton-Guided Instance Separation for Fine-Grained Segmentation in
Microscopy [23.848474219551818]
One of the fundamental challenges in microscopy (MS) image analysis is instance segmentation (IS)
We propose a novel one-stage framework named A2B-IS to address this challenge and enhance the accuracy of IS in MS images.
Our method has been thoroughly validated on two large-scale MS datasets.
arXiv Detail & Related papers (2024-01-18T11:14:32Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - DARTS: Double Attention Reference-based Transformer for Super-resolution [12.424350934766704]
We present DARTS, a transformer model for reference-based image super-resolution.
DARS learns joint representations of two image distributions to enhance the content of low-resolution input images.
We show that our transformer-based model performs competitively with state-of-the-art models.
arXiv Detail & Related papers (2023-07-17T20:57:16Z) - LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.