Pixel-Perfect Visual Geometry Estimation
- URL: http://arxiv.org/abs/2601.05246v1
- Date: Thu, 08 Jan 2026 18:59:49 GMT
- Title: Pixel-Perfect Visual Geometry Estimation
- Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang,
- Abstract summary: We present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds.<n>Our models achieve the best performance among all generative monocular and video depth estimation models.
- Score: 40.241009117140514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
Related papers
- Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement [6.91111219679588]
Blur2Sharp is a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images.<n>Our method employs a dual-conditioning architecture: first, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance.<n>We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy.
arXiv Detail & Related papers (2025-12-09T03:49:12Z) - PixelDiT: Pixel Diffusion Transformers for Image Generation [48.456815413366535]
PixelDiT is a single-stage, end-to-end model for Diffusion Transformers.<n>It eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space.<n>It achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin.
arXiv Detail & Related papers (2025-11-25T18:59:25Z) - Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers [45.701222598522456]
Pixel-Perfect Depth is a monocular depth estimation model based on pixel-space diffusion generation.<n>Our model achieves the best performance among all published generative models across five benchmarks.
arXiv Detail & Related papers (2025-10-08T17:59:33Z) - UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting [64.31900521467362]
No existing pre-training method is equally effective for both object- and scene-level point clouds.<n>We introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture.
arXiv Detail & Related papers (2025-06-11T17:23:21Z) - HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision.<n>We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects.<n>Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z) - DreamPolish: Domain Score Distillation With Progressive Geometry Generation [66.94803919328815]
We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures.
In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process.
In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain.
arXiv Detail & Related papers (2024-11-03T15:15:01Z) - PixelBytes: Catching Unified Representation for Multimodal Generation [0.0]
PixelBytes is an approach for unified multimodal representation learning.
We explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation.
We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset.
arXiv Detail & Related papers (2024-09-16T09:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.