Fugu-MT 論文翻訳(概要): NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

論文の概要: NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

arxiv url: http://arxiv.org/abs/2404.01300v3
Date: Thu, 18 Jul 2024 17:59:48 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-19 21:01:57.104727
Title: NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
Title（参考訳）: NeRF-MAE:ニューラルネットワーク分野の自己教師付き3次元表現学習のためのマスク付きオートエンコーダ
Authors: Muhammad Zubair Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus,
Abstract要約: 提案手法は,RGB画像から有効な3D表現を生成する方法を示す。我々は、この表現を、提案した擬似RGBデータに基づいて、180万枚以上の画像で事前訓練する。我々は,NeRFの自己教師型プレトレーニングであるNeRF-MAE(NeRF-MAE)を目覚ましいスケールで実施し,様々な3Dタスクの性能向上を実現した。
参考スコア（独自算出の注目度）: 57.617972778377215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.
Abstract（参考訳）: ニューラルネットワークはコンピュータビジョンやロボット工学において、セマンティクス、幾何学、ダイナミクスを推論するといった3次元視覚世界を理解する能力によって優れている。 2D画像から3Dシーンを密に表現するニューラルネットワークの能力を考えると、我々は疑問を呈する: マスク付きオートエンコーダを使って、自己教師付き事前訓練を拡大して、ポーズされたRGB画像から効果的な3D表現を生成することができるか? トランスを新しいデータモダリティに拡張するという驚くべき成功により、我々は標準の3Dビジョン変換器を用いて、NeRFのユニークな定式化に適合する。我々はNeRFの体積格子を変換器への高密度入力として利用し、情報密度が不均一な点雲のような他の3次元表現と対比し、その表現は不規則である。マスク付きオートエンコーダをNeRFなどの暗黙の表現に適用することの難しさから,サンプリングにカメラトラジェクトリを用いることで,ドメイン間のシーンを標準化する明示的な表現を抽出することを選んだ。我々の目標は、NeRFの放射率と密度グリッドからランダムなパッチをマスキングし、標準的な3Dスウィントランスを用いてマスクされたパッチを再構築することである。そうすることで、モデルは完全なシーンの意味的構造と空間的構造を学ぶことができる。我々は、この表現を、提案した擬似RGBデータに基づいて、180万枚以上の画像で事前訓練する。事前訓練後、エンコーダは効果的な3D転送学習に使用される。我々は,NeRFの自己教師型プレトレーニングであるNeRF-MAE(NeRF-MAE)を目覚ましいスケールで実施し,様々な3Dタスクの性能向上を実現した。ラベル付けされていない2Dデータを事前トレーニングに利用することにより、NeRF-MAEはFront3DおよびScanNetデータセットにおける自己教師付き3D事前トレーニングとNeRFシーン理解ベースラインを著しく上回り、3Dオブジェクト検出のための20% AP50と8% AP25の絶対的なパフォーマンス向上を実現した。

論文の概要: NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

関連論文リスト