Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
- URL: http://arxiv.org/abs/2507.23134v1
- Date: Wed, 30 Jul 2025 22:26:56 GMT
- Title: Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
- Authors: Sanghun Jung, Jingjing Zheng, Ke Zhang, Nan Qiao, Albert Y. C. Chen, Lu Xia, Chi Liu, Yuyin Sun, Xiao Zeng, Hsiang-Wei Huang, Byron Boots, Min Sun, Cheng-Hao Kuo,
- Abstract summary: We propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together.<n>We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals.<n>Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
- Score: 29.535042727523372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
Related papers
- Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [45.68105299990119]
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets.<n>We propose a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD.
arXiv Detail & Related papers (2025-03-10T17:55:22Z) - Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation [19.2297264550686]
Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods.
We introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities.
Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data.
arXiv Detail & Related papers (2024-08-16T07:52:00Z) - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes.
Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - 3D-QueryIS: A Query-based Framework for 3D Instance Segmentation [74.6998931386331]
Previous methods for 3D instance segmentation often maintain inter-task dependencies and the tendency towards a lack of robustness.
We propose a novel query-based method, termed as 3D-QueryIS, which is detector-free, semantic segmentation-free, and cluster-free.
Our 3D-QueryIS is free from the accumulated errors caused by the inter-task dependencies.
arXiv Detail & Related papers (2022-11-17T07:04:53Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation [26.169985423367393]
3D-MPA is a method for instance segmentation on 3D point clouds.
We learn proposal features from grouped point features that voted for the same object center.
A graph convolutional network introduces inter-proposal relations.
arXiv Detail & Related papers (2020-03-30T23:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.