Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
- URL: http://arxiv.org/abs/2503.16707v2
- Date: Fri, 28 Mar 2025 15:55:59 GMT
- Title: Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
- Authors: Jinlong Li, Cristiano Saltori, Fabio Poiesi, Nicu Sebe,
- Abstract summary: We propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D.<n>Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties.
- Score: 58.38294408121273
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.
Related papers
- SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection [24.367371441506116]
Multimodal 3D object detection based on deep neural networks has indeed made significant progress.
However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds.
We present SSLFusion, a novel scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy, a 3D-to-2D space alignment module, and a latent cross-modal fusion module.
arXiv Detail & Related papers (2025-04-07T15:15:06Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.
We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.
GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
Open-Vocabulary 3D Detection [40.965892255504144]
FM-OV3D is a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection.
We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP.
Experiments show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model.
arXiv Detail & Related papers (2023-12-22T06:34:23Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.