Fugu-MT 論文翻訳(概要): Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

論文の概要: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

arxiv url: http://arxiv.org/abs/2510.07316v1
Date: Wed, 08 Oct 2025 17:59:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.693283
Title: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers
Title（参考訳）: Semantics-Prompted Diffusion Transformerを用いた画素欠陥深さ
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang,
Abstract要約: Pixel-Perfect Depthはピクセル空間拡散生成に基づく単眼深度推定モデルである。本モデルは,5つのベンチマークにおいて,すべての生成モデルの中で最高の性能を達成している。
参考スコア（独自算出の注目度）: 45.701222598522456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
Abstract（参考訳）: 本稿では,画素空間拡散生成に基づく単眼深度推定モデルであるPixel-Perfect Depthについて述べる。現在の生成深度推定モデルによる微構造安定拡散と優れた性能の達成しかし、VAEは奥行きマップを潜在空間に圧縮するために必要であり、これは必然的にエッジや詳細で \textit{flying pixels} を導入する。我々のモデルは、VAEによる成果物を避けるため、画素空間で直接拡散生成を行うことにより、この問題に対処する。画素空間生成に伴う複雑さを克服するために,2つの新しい設計を導入する。 1) 視覚基盤モデルからのセマンティック表現をDiTに組み込んだセマンティックス・プロンプト拡散変換器(SP-DiT)による拡散プロセスの促進により、微細な視覚的詳細性を高めつつ、グローバルなセマンティック一貫性を保ちながら、グローバルなセマンティック一貫性を維持する。 2) 効率と精度をさらに高めるため、トークンの数を段階的に増加させるカスケードDiT設計。本モデルでは,5つのベンチマークにおいて,すべての生成モデルの中で最高の性能を達成し,エッジ・アウェア・ポイント・クラウド評価において,他のモデルを著しく上回っている。

論文の概要: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

関連論文リスト