Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
- URL: http://arxiv.org/abs/2510.11538v2
- Date: Tue, 14 Oct 2025 04:40:48 GMT
- Title: Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
- Authors: Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin,
- Abstract summary: Diffusion Transformers (DiTs) have emerged as a powerful backbone for visual generation.<n>Recent observations reveal emphMassive Activations (MAs) in their internal feature maps.<n>We propose textbfDetail textbfGuidance (textbfDG) to explicitly enhance local detail fidelity for DiTs.
- Score: 33.765941209545986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
Related papers
- Vision Transformers Need More Than Registers [70.42157905484765]
Vision Transformers (ViTs) provide general-purpose representations for diverse downstream tasks.<n> artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks.<n>We conclude that these artifacts originate from a lazy aggregation behavior.
arXiv Detail & Related papers (2026-02-25T20:42:35Z) - HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models [82.10385962490051]
Generative diffusion models show promise for data augmentation.<n>Applying them to fine-grained tasks presents a significant challenge.<n>HiGFA is a hierarchical, confidence-driven orchestration that generates diverse yet faithful synthetic images.
arXiv Detail & Related papers (2025-11-16T10:46:16Z) - RefAM: Attention Magnets for Zero-Shot Referral Segmentation [103.98022860792504]
We introduce a new method that exploits features, attention scores, from diffusion transformers for downstream tasks.<n>Key insight is that stop words act as attention magnets.<n>We propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters.
arXiv Detail & Related papers (2025-09-26T17:59:57Z) - Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z) - Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations [41.28795436569343]
Diffusion Transformers (DiTs) exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others.<n>We propose textbfDiffusion textbfTransformer textbfFeature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs.
arXiv Detail & Related papers (2025-05-24T08:20:36Z) - Automated Learning of Semantic Embedding Representations for Diffusion Models [1.688134675717698]
We employ a multi-level denoising autoencoder framework to expand the representation capacity of denoising diffusion models.<n>Our work justifies that DDMs are not only suitable for generative tasks, but also potentially advantageous for general-purpose deep learning applications.
arXiv Detail & Related papers (2025-05-09T02:10:46Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - Disentangling Masked Autoencoders for Unsupervised Domain Generalization [57.56744870106124]
Unsupervised domain generalization is fast gaining attention but is still far from well-studied.
Disentangled Masked Auto (DisMAE) aims to discover the disentangled representations that faithfully reveal intrinsic features.
DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders.
arXiv Detail & Related papers (2024-07-10T11:11:36Z) - Towards Better Data Exploitation in Self-Supervised Monocular Depth
Estimation [14.262669370264994]
In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets.
Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation.
Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth.
arXiv Detail & Related papers (2023-09-11T06:18:05Z) - Distilling Representations from GAN Generator via Squeeze and Span [55.76208869775715]
We propose to distill knowledge from GAN generators by squeezing and spanning their representations.
We span the distilled representation of the synthetic domain to the real domain by also using real training data to remedy the mode collapse of GANs.
arXiv Detail & Related papers (2022-11-06T01:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.