Related papers: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

URL: http://arxiv.org/abs/2510.07316v2
Date: Wed, 29 Oct 2025 02:15:20 GMT
Title: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang,
Abstract summary: Pixel-Perfect Depth is a monocular depth estimation model based on pixel-space diffusion generation.<n>Our model achieves the best performance among all published generative models across five benchmarks.
Score: 45.701222598522456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

Related papers

Pixel-Perfect Visual Geometry Estimation [40.241009117140514]
We present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds.<n>Our models achieve the best performance among all generative monocular and video depth estimation models.
arXiv Detail & Related papers (2026-01-08T18:59:49Z)
PixelDiT: Pixel Diffusion Transformers for Image Generation [48.456815413366535]
PixelDiT is a single-stage, end-to-end model for Diffusion Transformers.<n>It eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space.<n>It achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin.
arXiv Detail & Related papers (2025-11-25T18:59:25Z)
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation [93.6273078684831]
We propose a frequency-DeCoupled pixel diffusion framework to pursue a more efficient pixel diffusion paradigm.<n>With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance.<n>Experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet.
arXiv Detail & Related papers (2025-11-24T17:59:06Z)
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z)
DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion [63.89701893364156]
We propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion.<n>Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality.<n> Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness.
arXiv Detail & Related papers (2025-09-28T08:05:43Z)
High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy [23.431898388115044]
High-precision dichotomous image segmentation (DIS) is a task of extracting fine-grained objects from high-resolution images.<n>Existing methods face a dilemma: non-diffusion methods work efficiently but suffer from false or missed detections due to weak semantics.<n>We find pseudo depth information from monocular depth estimation models can provide essential semantic understanding.
arXiv Detail & Related papers (2025-03-08T07:02:28Z)
High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity [69.32473738284374]
Diffusion models have revolutionized text-to-image synthesis by delivering exceptional quality, fine detail resolution, and strong contextual awareness.<n>We propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models.<n>Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process.
arXiv Detail & Related papers (2024-10-14T02:49:23Z)
Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z)
Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion [3.5940515868907164]
We propose a new model for depth completion based on an encoder-decoder structure.<n>Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.<n>We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.