Fugu-MT 論文翻訳(概要): DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

論文の概要: DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

arxiv url: http://arxiv.org/abs/2406.12095v1
Date: Mon, 17 Jun 2024 21:15:13 GMT
ステータス: 翻訳完了
システム内更新日: 2024-06-19 23:47:35.819635
Title: DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
Title（参考訳）: DistillNeRF: ニューラルネットワークと基礎モデル特徴の蒸留による単一視点画像からの3次元シーンの認識
Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus,
Abstract要約: DistillNeRFは、自動運転における3D環境を理解するための自己教師型学習フレームワークである。スパースでシングルフレームのマルチビューカメラ入力からリッチなニューラルシーン表現を予測する。 RGB、奥行き、特徴画像を再構成するために、異なるレンダリングで自己教師される。
参考スコア（独自算出の注目度）: 65.8738034806085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
Abstract（参考訳）: 自律運転における2次元の限られた観察から3次元環境を理解することの難しさに対処する自己教師型学習フレームワークであるDistillNeRFを提案する。提案手法は,スパース,シングルフレームのマルチビューカメラ入力からリッチなニューラルシーン表現を予測する一般化可能なフィードフォワードモデルであり,RGB,深度,特徴画像の再構成のために,可変レンダリングを用いて自己教師を行う。我々の最初の洞察は、トレーニングのために深度と仮想カメラターゲットを生成することで、シーンごとの最適化されたニューラルレージアンスフィールド(NeRF)を活用することである。次に,CLIPやDINOv2のような事前訓練された2次元基礎モデルから特徴を抽出し,コストのかかる3次元アノテーションを必要とせずに,下流の様々なタスクを可能にすることを提案する。これら2つの知見を活用するために,2段階のリフト・スプラット・エンコーダとパラメータ化されたスパース階層のボクセル表現を用いた新しいモデルアーキテクチャを導入する。 NuScenesデータセットの実験結果によると、DistillNeRFはシーン再構成、新規ビュー合成、深度推定といった既存の自己監督手法よりも大幅に優れており、競争力のあるゼロショット3Dセマンティック占有率予測や、蒸留基礎モデルの特徴によるオープンワールドのシーン理解を可能にしている。デモとコードはhttps://distillnerf.github.io/.com/で公開される。

論文の概要: DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

関連論文リスト