Related papers: Token Pruning for In-Context Generation in Diffusion Transformers

Token Pruning for In-Context Generation in Diffusion Transformers

URL: http://arxiv.org/abs/2602.01609v1
Date: Mon, 02 Feb 2026 03:54:32 GMT
Title: Token Pruning for In-Context Generation in Diffusion Transformers
Authors: Junqing Lin, Xingyu Zheng, Pei Cheng, Bin Fu, Jingwei Sun, Guangzhong Sun,
Abstract summary: In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples.<n>Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm.<n>We introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs.
Score: 20.121758465381053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

Related papers

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers [10.251154683874033]
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs.<n>We propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures.
arXiv Detail & Related papers (2025-11-20T08:53:07Z)
Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization [6.799413002613627]
Todyformer is a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers. We show that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks.
arXiv Detail & Related papers (2024-02-02T23:05:30Z)
Corner-to-Center Long-range Context Model for Efficient Learned Image Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations. We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions. In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z)
Entropy Transformer Networks: A Learning Approach via Tangent Bundle Data Manifold [8.893886200299228]
This paper focuses on an accurate and fast approach for image transformation employed in the design of CNN architectures. A novel Entropy STN (ESTN) is proposed that interpolates on the data manifold distributions. Experiments on challenging benchmarks show that the proposed ESTN can improve predictive accuracy over a range of computer vision tasks.
arXiv Detail & Related papers (2023-07-24T04:21:51Z)
Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task. We employ a restrictive CNN with small and non-overlapping RF for token representation. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
Learning Context-Based Non-local Entropy Modeling for Image Compression [140.64888994506313]
In this paper, we propose a non-local operation for context modeling by employing the global similarity within the context. The entropy model is further adopted as the rate loss in a joint rate-distortion optimization. Considering that the width of the transforms is essential in training low distortion models, we finally produce a U-Net block in the transforms to increase the width with manageable memory consumption and time complexity.
arXiv Detail & Related papers (2020-05-10T13:28:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.