Related papers: Segment Any 3D Object with Language

Segment Any 3D Object with Language

URL: http://arxiv.org/abs/2404.02157v1
Date: Tue, 2 Apr 2024 17:59:10 GMT
Title: Segment Any 3D Object with Language
Authors: Seungjun Lee, Yuyang Zhao, Gim Hee Lee,
Abstract summary: We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
Score: 58.471327490684295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even close to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.

Related papers

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D [0.0]
3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation.<n>Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs)<n>We leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks.
arXiv Detail & Related papers (2025-09-29T09:43:00Z)
PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum [20.206273757144547]
PGOV3D is a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation.<n>We pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry.<n>In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex.
arXiv Detail & Related papers (2025-06-30T08:13:07Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation. We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting [0.0]
We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations. We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines.
arXiv Detail & Related papers (2024-12-14T21:26:44Z)
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation [72.12250272218792]
We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks. The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
arXiv Detail & Related papers (2024-11-20T12:02:12Z)
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding. RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z)
UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation [46.998093729036334]
We propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module. To facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs.
arXiv Detail & Related papers (2024-01-21T04:13:58Z)
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation [40.49322398635262]
We propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features. We propose two novel loss functions: object-level distillation loss and voxel-level distillation loss.
arXiv Detail & Related papers (2024-01-04T18:39:32Z)
SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach. Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations. Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
Fine-Grained 3D Shape Classification with Hierarchical Part-View Attentions [70.0171362989609]
We propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2020-05-26T06:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.