PixNerd: Pixel Neural Field Diffusion
- URL: http://arxiv.org/abs/2507.23268v2
- Date: Mon, 04 Aug 2025 02:46:11 GMT
- Title: PixNerd: Pixel Neural Field Diffusion
- Authors: Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang,
- Abstract summary: We propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution.<n>Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256times256$ and 2.84 FID on ImageNet $512times512$ without any complex cascade pipeline or VAE.
- Score: 30.872185815524286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
Related papers
- HNOSeg-XS: Extremely Small Hartley Neural Operator for Efficient and Resolution-Robust 3D Image Segmentation [3.990336239705776]
We propose a resolution-robust HNOSeg-XS architecture for medical image segmentation.<n>It is resolution robust, fast, memory efficient, and extremely parameter efficient.<n>It was tested on the BraTS'23, KiTS'23, and MVSeg'23 datasets with a Tesla V100 GPU.
arXiv Detail & Related papers (2025-07-10T22:33:19Z) - FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
We propose FlowDCN, a purely convolution-based generative model that can efficiently generate high-quality images at arbitrary resolutions.
FlowDCN achieves the state-of-the-art 4.30 sFID on $256times256$ ImageNet Benchmark and comparable resolution extrapolation results.
We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.
arXiv Detail & Related papers (2024-10-30T02:48:50Z) - PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher [55.22994720855957]
PaGoDA is a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution.
With the proposed pipeline, PaGoDA achieves a $64times$ reduced cost in training its diffusion model on 8x downsampled data.
PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models.
arXiv Detail & Related papers (2024-05-23T17:39:09Z) - Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently.
They regard each pixel as a token, thus suffering from an information loss issue.
We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z) - A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE [0.8403582577557918]
Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity.
We propose a lightweight hybrid model which uses Neural ODE as a backbone instead of ResNet.
The proposed model is deployed on a modest-sized FPGA device for edge computing.
arXiv Detail & Related papers (2024-01-05T09:32:39Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - Reduce Information Loss in Transformers for Pluralistic Image Inpainting [112.50657646357494]
We propose a new transformer based framework "PUT" to keep input information as much as possible.
PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.
arXiv Detail & Related papers (2022-05-10T17:59:58Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z) - Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues.
We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.