Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
- URL: http://arxiv.org/abs/2510.08316v1
- Date: Thu, 09 Oct 2025 15:01:26 GMT
- Title: Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
- Authors: Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen,
- Abstract summary: Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR.<n>We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimize reconstruction, affinity, and diversity to yield semantically organized representations.<n>We further design the Cross-modal Affordance Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps.
- Score: 45.19482892758984
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
Related papers
- SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z) - BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis [33.53327976669034]
We revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics.<n>We introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features.
arXiv Detail & Related papers (2025-03-16T15:13:11Z) - DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting [0.0]
We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations.<n>We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines.
arXiv Detail & Related papers (2024-12-14T21:26:44Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise [9.12768731317489]
We propose a novel distillation learning framework named CUS3D.
An object-level denosing projection module is designed to screen out the noise'' and ensure more accurate 3D feature.
Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space.
arXiv Detail & Related papers (2024-09-21T02:17:35Z) - Augmented Efficiency: Reducing Memory Footprint and Accelerating Inference for 3D Semantic Segmentation through Hybrid Vision [9.96433151449016]
This paper introduces a novel approach to 3D semantic segmentation, distinguished by incorporating a hybrid blend of 2D and 3D computer vision techniques.
We conduct 2D semantic segmentation on RGB images linked to 3D point clouds and extend the results to 3D using an extrusion technique for specific class labels.
This model serves as the current state-of-the-art 3D semantic segmentation model on the KITTI-360 dataset.
arXiv Detail & Related papers (2024-07-23T00:04:10Z) - DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations.
We leverage the strong semantic prior within a 3D generative model to train a semantic decoder.
Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.