COS3D: Collaborative Open-Vocabulary 3D Segmentation
- URL: http://arxiv.org/abs/2510.20238v1
- Date: Thu, 23 Oct 2025 05:45:15 GMT
- Title: COS3D: Collaborative Open-Vocabulary 3D Segmentation
- Authors: Runsong Zhu, Ka-Hei Hui, Zhengzhe Liu, Qianyi Wu, Weiliang Tang, Shi Qiu, Pheng-Ann Heng, Chi-Wing Fu,
- Abstract summary: We present COS3D, a new collaborative prompt-segmentation framework.<n>We first introduce the new concept of collaborative field, comprising an instance field and a language field.<n>During inference, to bridge distinct characteristics of the two fields, we design an adaptive language-to-instance prompt refinement.
- Score: 86.41533122575981
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.
Related papers
- Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge [45.19482892758984]
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR.<n>We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimize reconstruction, affinity, and diversity to yield semantically organized representations.<n>We further design the Cross-modal Affordance Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps.
arXiv Detail & Related papers (2025-10-09T15:01:26Z) - PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum [20.206273757144547]
PGOV3D is a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation.<n>We pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry.<n>In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex.
arXiv Detail & Related papers (2025-06-30T08:13:07Z) - PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond [70.95930509071451]
PartField is a feedforward approach for learning part-based 3D features.<n>PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods.
arXiv Detail & Related papers (2025-04-15T17:58:16Z) - SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection [4.930667479611019]
This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection.<n>It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space.<n>We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
arXiv Detail & Related papers (2024-12-03T16:53:58Z) - Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
We introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations.<n>Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm.<n>For systematic evaluation, we contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - 3D-GRES: Generalized 3D Referring Expression Segmentation [77.10044505645064]
3D Referring Expression (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description.
Generalized 3D Referring Expression (3D-GRES) extends the capability to segment any number of instances based on natural language instructions.
arXiv Detail & Related papers (2024-07-30T08:59:05Z) - SegPoint: Segment Any Point Cloud via Large Language Model [62.69797122055389]
We propose a model, called SegPoint, to produce point-wise segmentation masks across a diverse range of tasks.
SegPoint is the first model to address varied segmentation tasks within a single framework.
arXiv Detail & Related papers (2024-07-18T17:58:03Z) - Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects.
We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands.
We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.