PixelDiT: Pixel Diffusion Transformers for Image Generation
- URL: http://arxiv.org/abs/2511.20645v1
- Date: Tue, 25 Nov 2025 18:59:25 GMT
- Title: PixelDiT: Pixel Diffusion Transformers for Image Generation
- Authors: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo,
- Abstract summary: PixelDiT is a single-stage, end-to-end model for Diffusion Transformers.<n>It eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space.<n>It achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin.
- Score: 48.456815413366535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Related papers
- PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss [47.868429337792314]
We propose PixelGen, a simple pixel diffusion framework with perceptual supervision.<n>Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses.<n>An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics.
arXiv Detail & Related papers (2026-02-02T18:59:42Z) - Pixel-Perfect Visual Geometry Estimation [40.241009117140514]
We present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds.<n>Our models achieve the best performance among all generative monocular and video depth estimation models.
arXiv Detail & Related papers (2026-01-08T18:59:49Z) - DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation [93.6273078684831]
We propose a frequency-DeCoupled pixel diffusion framework to pursue a more efficient pixel diffusion paradigm.<n>With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance.<n>Experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet.
arXiv Detail & Related papers (2025-11-24T17:59:06Z) - DiP: Taming Diffusion Models in Pixel Space [91.51011771517683]
Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction.<n>Co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details.
arXiv Detail & Related papers (2025-11-24T06:55:49Z) - Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers [45.701222598522456]
Pixel-Perfect Depth is a monocular depth estimation model based on pixel-space diffusion generation.<n>Our model achieves the best performance among all published generative models across five benchmarks.
arXiv Detail & Related papers (2025-10-08T17:59:33Z) - Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion [34.70370851239368]
We show that pixel-space models can be very competitive to latent models both in quality and efficiency.<n>We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions.
arXiv Detail & Related papers (2024-10-25T06:20:06Z) - Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
Diffusion Transformers [2.078423403798577]
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution.
Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers.
arXiv Detail & Related papers (2024-01-21T21:49:49Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z) - PixelPyramids: Exact Inference Models from Lossless Image Pyramids [58.949070311990916]
Pixel-Pyramids is a block-autoregressive approach with scale-specific representations to encode the joint distribution of image pixels.
It yields state-of-the-art results for density estimation on various image datasets, especially for high-resolution data.
For CelebA-HQ 1024 x 1024, we observe that the density estimates are improved to 44% of the baseline despite sampling speeds superior even to easily parallelizable flow-based models.
arXiv Detail & Related papers (2021-10-17T10:47:29Z) - Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image.
We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.