Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation
- URL: http://arxiv.org/abs/2512.19088v1
- Date: Mon, 22 Dec 2025 06:57:42 GMT
- Title: Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation
- Authors: Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian,
- Abstract summary: We propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector.<n>Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances.
- Score: 36.41046448860009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.
Related papers
- Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z) - CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation [13.871856894814005]
We propose to cut semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene.<n>We also propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal.
arXiv Detail & Related papers (2024-11-25T12:11:27Z) - Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [57.53523870705433]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det.
OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes.
It employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors.
arXiv Detail & Related papers (2024-11-23T21:37:21Z) - EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.<n>An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation [91.40798599544136]
We propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D.<n>It effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation.<n>We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector.
arXiv Detail & Related papers (2024-06-04T17:59:31Z) - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes.
Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation [32.508069732371105]
OpenIns3D is a new 3D-input-only framework for 3D open-vocabulary scene understanding.
It achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks.
arXiv Detail & Related papers (2023-09-01T17:59:56Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.