OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
- URL: http://arxiv.org/abs/2512.03532v1
- Date: Wed, 03 Dec 2025 07:51:03 GMT
- Title: OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
- Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu,
- Abstract summary: We introduce OpenTrack3D, a generalizable and accurate framework for open-vocabulary 3D instance segmentation.<n>Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals.<n>Our pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available.
- Score: 5.073373064582996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
Related papers
- GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z) - SAB3R: Semantic-Augmented Backbone in 3D Reconstruction [19.236494823612507]
We introduce a new task, Map and Locate, which unifies the objectives of open-vocabulary segmentation and 3D reconstruction.<n>Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries.<n>This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization.
arXiv Detail & Related papers (2025-06-02T18:00:04Z) - Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability.
Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder.
Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z) - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes.
Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z) - Flattening-Net: Deep Regular 2D Representation for 3D Point Cloud
Analysis [66.49788145564004]
We present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology.
Our methods perform favorably against the current state-of-the-art competitors.
arXiv Detail & Related papers (2022-12-17T15:05:25Z) - Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds
of Large Scenes with Learned Virtual View Visibility [17.929307870456416]
We present a novel framework for mesh reconstruction from unstructured point clouds.
We take advantage of the learned visibility of the 3D points in the virtual views and traditional graph-cut based mesh generation.
arXiv Detail & Related papers (2021-08-18T20:28:16Z) - S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point
Clouds [0.16799377888527683]
We present S3CNet, a sparse convolution based neural network that predicts the semantically completed scene from a single, unified LiDAR point cloud.
We show that our proposed method outperforms all counterparts on the 3D task, achieving state-of-the art results on the Semantic KITTI benchmark.
arXiv Detail & Related papers (2020-12-16T20:14:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.