Related papers: DiP: Taming Diffusion Models in Pixel Space

DiP: Taming Diffusion Models in Pixel Space

URL: http://arxiv.org/abs/2511.18822v2
Date: Thu, 27 Nov 2025 09:29:50 GMT
Title: DiP: Taming Diffusion Models in Pixel Space
Authors: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai,
Abstract summary: Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction.<n>Co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details.
Score: 91.51011771517683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256$\times$256.

Related papers

NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration [5.158202521463481]
NanoSD is a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.<n>We show how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency.<n>When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation.
arXiv Detail & Related papers (2026-01-14T19:30:53Z)
E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources [12.244453688491731]
Efficient Multimodal Diffusion Transformer (E-MMDiT) is an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis.<n>Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO.
arXiv Detail & Related papers (2025-10-31T03:13:08Z)
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z)
Exploring Diffusion with Test-Time Training on Efficient Image Restoration [1.3830502387127932]
DiffRWKVIR is a novel framework unifying Test-Time Training (TTT) with efficient diffusion.<n>Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.
arXiv Detail & Related papers (2025-06-17T14:01:59Z)
Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization [66.17291150498276]
We introduce textbfMeasurements textbfOptimization (MO), a more efficient plug-and-play module for integrating measurement information at each step of the inverse problem-solving process.<n>By using MO, we establish state-of-the-art (SOTA) performance across multiple tasks, with key advantages.
arXiv Detail & Related papers (2024-12-05T07:44:18Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation. ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z)
Efficient Context Integration through Factorized Pyramidal Learning for Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner. We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect. Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z)
SDM: Spatial Diffusion Model for Large Hole Image Inpainting [106.90795513361498]
We present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image. Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion.
arXiv Detail & Related papers (2022-12-06T13:30:18Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.