XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
- URL: http://arxiv.org/abs/2411.13243v1
- Date: Wed, 20 Nov 2024 12:02:12 GMT
- Title: XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
- Authors: Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou, Jiwen Lu,
- Abstract summary: We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D.
We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks.
The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
- Score: 72.12250272218792
- License:
- Abstract: Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D.
Related papers
- Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking [6.599971425078935]
Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks.
This challenge arises from their unsupervised merging approach, where dense 2D masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision.
We propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames.
arXiv Detail & Related papers (2024-11-25T08:26:31Z) - Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation [47.08813064337934]
This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective.
MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries.
Our experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence.
arXiv Detail & Related papers (2024-07-01T12:07:26Z) - OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.
Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z) - Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability.
Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder.
Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z) - MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation [11.123421412837336]
Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories.
Recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frames.
We propose a novel metric, view consensus rate, to enhance the utilization of multi-view observations.
arXiv Detail & Related papers (2024-01-15T14:56:15Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
Foundation Models [18.315856283440386]
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding.
Their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap.
We propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and sourced captions from foundation models.
arXiv Detail & Related papers (2023-05-15T16:36:56Z) - Segment Anything in 3D with Radiance Fields [83.14130158502493]
This paper generalizes the Segment Anything Model (SAM) to segment 3D objects.
We refer to the proposed solution as SA3D, short for Segment Anything in 3D.
We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds.
arXiv Detail & Related papers (2023-04-24T17:57:15Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.