Related papers: GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

URL: http://arxiv.org/abs/2404.00931v1
Date: Mon, 1 Apr 2024 05:19:50 GMT
Title: GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
Authors: Yunsong Wang, Hanlin Chen, Gim Hee Lee,
Abstract summary: Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
Score: 50.68719394443926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However, the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume, and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism, which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation, eliminating the need for ground truth semantic labels or depth priors, and effectively generalize across scenes and datasets without fine-tuning.

Related papers

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding [58.38294408121273]
We propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties.
arXiv Detail & Related papers (2025-03-20T20:58:48Z)
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation [72.12250272218792]
We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks. The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
arXiv Detail & Related papers (2024-11-20T12:02:12Z)
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z)
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space. Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z)
O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation [9.431926560072412]
We propose O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field. Experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes.
arXiv Detail & Related papers (2024-04-10T08:54:43Z)
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space. Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z)
UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation [46.998093729036334]
We propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module. To facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs.
arXiv Detail & Related papers (2024-01-21T04:13:58Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Open-NeRF: Towards Open Vocabulary NeRF Decomposition [14.759265492381509]
We present Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF) Open-NeRF leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM) Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF citelerf and FFD citeffd in open-vocabulary scenarios.
arXiv Detail & Related papers (2023-10-25T05:43:14Z)
GNeSF: Generalizable Neural Semantic Fields [48.49860868061573]
We introduce a generalizable 3D segmentation framework based on implicit representation. We propose a novel soft voting mechanism to aggregate the 2D semantic information from different views for each 3D point. Our approach can even outperform existing strong supervision-based approaches with only 2D annotations.
arXiv Detail & Related papers (2023-10-24T10:40:51Z)
Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF) A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.