IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation
- URL: http://arxiv.org/abs/2501.04995v1
- Date: Thu, 09 Jan 2025 06:20:00 GMT
- Title: IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation
- Authors: Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, Xiaoshuai Sun,
- Abstract summary: 3D Referring Expression (3D-RES) aims to segment point cloud scenes based on a given expression.<n>Existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity.<n>In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN) to enhance the model's reasoning capabilities.
- Score: 29.714642487981358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
Related papers
- DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering [106.96097136553105]
3D Question Answering (3D QA) requires the model to understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation.
Existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images.
We propose a Dual-vision Scene Perception Network (DSPNet) to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA.
arXiv Detail & Related papers (2025-03-05T05:13:53Z) - BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence [11.91274849875519]
We introduce a novel image-centric 3D perception model, BIP3D, to overcome the limitations of point-centric methods.<n>We leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding.<n>In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
arXiv Detail & Related papers (2024-11-22T11:35:42Z) - PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition [8.15444057380305]
We focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition.
We propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition.
arXiv Detail & Related papers (2024-04-30T00:16:59Z) - Weakly Supervised Monocular 3D Detection with a Single-View Image [58.57978772009438]
Monocular 3D detection aims for precise 3D object localization from a single-view image.
We propose SKD-WM3D, a weakly supervised monocular 3D detection framework.
We show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
arXiv Detail & Related papers (2024-02-29T13:26:47Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.