OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views
- URL: http://arxiv.org/abs/2404.03650v1
- Date: Thu, 4 Apr 2024 17:59:08 GMT
- Title: OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views
- Authors: Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari,
- Abstract summary: We propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF.
Our work shows that using pixel-wise VLM features results in an overall less complex architecture without the need for additional DINO regularization.
For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
- Score: 90.71215823587875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images [17.344430840048094]
We present OpenDlign, a novel open-world 3D model using depth-aligned images for robust multimodal alignment.
OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters.
arXiv Detail & Related papers (2024-04-25T11:53:36Z) - PERF: Panoramic Neural Radiance Field from a Single Panorama [109.31072618058043]
PERF is a novel view synthesis framework that trains a panoramic neural radiance field from a single panorama.
We propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene.
Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.
arXiv Detail & Related papers (2023-10-25T17:59:01Z) - Open-NeRF: Towards Open Vocabulary NeRF Decomposition [14.759265492381509]
We present Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF)
Open-NeRF leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM)
Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF citelerf and FFD citeffd in open-vocabulary scenarios.
arXiv Detail & Related papers (2023-10-25T05:43:14Z) - SegNeRF: 3D Part Segmentation with Neural Radiance Fields [63.12841224024818]
SegNeRF is a neural field representation that integrates a semantic field along with the usual radiance field.
SegNeRF is capable of simultaneously predicting geometry, appearance, and semantic information from posed images, even for unseen objects.
SegNeRF is able to generate an explicit 3D model from a single image of an object taken in the wild, with its corresponding part segmentation.
arXiv Detail & Related papers (2022-11-21T07:16:03Z) - PeRFception: Perception using Radiance Fields [72.99583614735545]
We create the first large-scale implicit representation datasets for perception tasks, called the PeRFception.
It shows a significant memory compression rate (96.4%) from the original dataset, while containing both 2D and 3D information in a unified form.
We construct the classification and segmentation models that directly take as input this implicit format and also propose a novel augmentation technique to avoid overfitting on backgrounds of images.
arXiv Detail & Related papers (2022-08-24T13:32:46Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - pixelNeRF: Neural Radiance Fields from One or Few Images [20.607712035278315]
pixelNeRF is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images.
We conduct experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects.
In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction.
arXiv Detail & Related papers (2020-12-03T18:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.