Fugu-MT 論文翻訳(概要): Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

論文の概要: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

arxiv url: http://arxiv.org/abs/2512.23705v1
Date: Mon, 29 Dec 2025 18:59:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:30.622804
Title: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Title（参考訳）: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Authors: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao,
Abstract要約: TransPhy3Dは、Blender/Cyclesで組み立てられた透明なシーンの合成ビデオコーパスである。我々は,軽量なLoRAアダプタを用いて,深度(および正常値)の動画翻訳を学習する。結果のモデルであるDKTは、透過性を含む実および合成ビデオベンチマーク上のゼロショットSOTAである。
参考スコア（独自算出の注目度）: 16.61765374101053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
Abstract（参考訳）: 屈折、反射、透過は、ステレオ、ToF、純粋に識別可能な単分子深度の背後にある仮定を破り、穴と時間的に不安定な推定を引き起こす。我々のキーとなる観察は、現代のビデオ拡散モデルは、既に透過的な透明な現象を合成しており、光学規則を内部化したことを示唆している。 TransPhy3Dは、透明で反射的なシーンの合成ビデオコーパスで、Blender/Cyclesでレンダリングされた11kのシークエンスです。シーンは、カテゴリーリッチな静的資産と、ガラス/プラスチック/金属材料を合わせた形状リッチな手続き資産のキュレートされたバンクから組み立てられる。 RGB + depth + normalsを物理ベースのレイトレーシングとOptiX denoisingでレンダリングします。大規模なビデオ拡散モデルから、軽量なLoRAアダプタを用いて、深度(および正常度)の動画翻訳を学習する。トレーニング中、私たちは、DiTバックボーン内のRGBと(ノイズの多い)深度潜伏剤を結合し、TransPhy3Dおよび既存のフレームワイド合成データセットのコトレーニングを行い、任意の長さの入力ビデオに対して時間的に一貫した予測を行う。結果のモデルであるDKTは、ClearPose、DREDS(CatKnown/CatNovel)、TransPhy3D-Testといった透過性を含むリアルおよび合成ビデオベンチマークでゼロショットSOTAを実現している。強い画像/ビデオベースラインの精度と時間的一貫性を改善し、通常の変種はClearPoseで最高のビデオ正規推定結果を設定する。コンパクトバージョン 1.3B は ~0.17 s/frame で動作する。把握スタックに統合されたDKTの深さは、半透明、反射面、拡散面における成功率を高め、事前推定値よりも優れる。これらの結果は、"拡散は透明性を知っている"というより広い主張を支持している。生成ビデオの先行は、実世界の操作に挑戦するための堅牢で時間的に一貫性のある知覚へと、効率よく、ラベル無しで再利用することができる。

論文の概要: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

関連論文リスト