PD-TPE: Parallel Decoder with Text-guided Position Encoding for 3D Visual Grounding
- URL: http://arxiv.org/abs/2407.14491v1
- Date: Fri, 19 Jul 2024 17:44:33 GMT
- Title: PD-TPE: Parallel Decoder with Text-guided Position Encoding for 3D Visual Grounding
- Authors: Chenshu Hou, Liang Peng, Xiaopei Wu, Wenxiao Wang, Xiaofei He,
- Abstract summary: 3D visual grounding aims to locate the target object mentioned by free-formed natural language descriptions in 3D point cloud scenes.
We propose PD-TPE, a visual-language model with a double-branch decoder.
We surpass the state-of-the-art on two widely adopted 3D visual grounding datasets.
- Score: 20.422852022310945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual grounding aims to locate the target object mentioned by free-formed natural language descriptions in 3D point cloud scenes. Most previous work requires the encoder-decoder to simultaneously align the attribute information of the target object and its relational information with the surrounding environment across modalities. This causes the queries' attention to be dispersed, potentially leading to an excessive focus on points irrelevant to the input language descriptions. To alleviate these issues, we propose PD-TPE, a visual-language model with a double-branch decoder. The two branches perform proposal feature decoding and surrounding layout awareness in parallel. Since their attention maps are not influenced by each other, the queries focus on tokens relevant to each branch's specific objective. In particular, we design a novel Text-guided Position Encoding method, which differs between the two branches. In the main branch, the priori relies on the relative positions between tokens and predicted 3D boxes, which direct the model to pay more attention to tokens near the object; in the surrounding branch, it is guided by the similarity between visual and text features, so that the queries attend to tokens that can provide effective layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and NR3D, by 1.8% and 2.2%, respectively. Codes will be made publicly available.
Related papers
- Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation [32.04698431036215]
In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework.
We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations.
arXiv Detail & Related papers (2024-11-09T02:38:29Z) - LidaRefer: Context-aware Outdoor 3D Visual Grounding for Autonomous Driving [1.0589208420411014]
3D visual grounding aims to locate objects or regions within 3D scenes guided by natural language descriptions.<n>Large-scale outdoor LiDAR scenes are dominated by background points and contain limited foreground information.<n>LidaRefer is a context-aware 3D VG framework for outdoor scenes.
arXiv Detail & Related papers (2024-11-07T01:12:01Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene.
Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object.
We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [51.45832752942529]
We propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet.
We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels.
Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets.
arXiv Detail & Related papers (2020-04-12T19:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.