A Unified Framework for 3D Point Cloud Visual Grounding
- URL: http://arxiv.org/abs/2308.11887v2
- Date: Mon, 20 Nov 2023 08:57:58 GMT
- Title: A Unified Framework for 3D Point Cloud Visual Grounding
- Authors: Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong
Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji
- Abstract summary: This paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3DRefTR.
Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch.
This elaborate design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model.
- Score: 60.75319271082741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thanks to its precise spatial referencing, 3D point cloud visual grounding is
essential for deep understanding and dynamic interaction in 3D environments,
encompassing 3D Referring Expression Comprehension (3DREC) and Segmentation
(3DRES). We argue that 3DREC and 3DRES should be unified in one framework,
which is also a natural progression in the community. To explain, 3DREC help
3DRES locate the referent, while 3DRES also facilitate 3DREC via more
fine-grained language-visual alignment. To achieve this, this paper takes the
initiative step to integrate 3DREC and 3DRES into a unified framework, termed
3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature
3DREC model and leverage ready query embeddings and visual tokens from the
3DREC model to construct a dedicated mask branch. Specially, we propose
Superpoint Mask Branch, which serves a dual purpose: i) By harnessing on the
inherent association between the superpoints and point cloud, it eliminates the
heavy computational overhead on the high-resolution visual features for
upsampling; ii) By leveraging the heterogeneous CPU-GPU parallelism, while the
GPU is occupied generating visual and language tokens, the CPU concurrently
produces superpoints, equivalently accomplishing the upsampling computation.
This elaborate design enables 3DRefTR to achieve both well-performing 3DRES and
3DREC capacities with only a 6% additional latency compared to the original
3DREC model. Empirical evaluations affirm the superiority of 3DRefTR.
Specifically, on the ScanRefer dataset, 3DRefTR surpasses the state-of-the-art
3DRES method by 12.43% in mIoU and improves upon the SOTA 3DREC method by 0.6%
Acc@0.25IoU. The codes and models will be released soon.
Related papers
- Repeat and Concatenate: 2D to 3D Image Translation with 3D to 3D Generative Modeling [14.341099905684844]
This paper investigates a 2D to 3D image translation method with a straightforward technique, enabling correlated 2D X-ray to 3D CT-like reconstruction.
We observe that existing approaches, which integrate information across multiple 2D views in the latent space lose valuable signal information during latent encoding. Instead, we simply repeat and the 2D views into higher-channel 3D volumes and approach the 3D reconstruction challenge as a straightforward 3D to 3D generative modeling problem.
This method enables the reconstructed 3D volume to retain valuable information from the 2D inputs, which are passed between channel states in a Swin U
arXiv Detail & Related papers (2024-06-26T15:18:20Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models.
We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models.
Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z) - Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.
We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.
In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z) - Regulating Intermediate 3D Features for Vision-Centric Autonomous
Driving [26.03800936700545]
We propose to regulate intermediate dense 3D features with the help of volume rendering.
Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features.
arXiv Detail & Related papers (2023-12-19T04:09:05Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z) - Appearance-Preserving 3D Convolution for Video-based Person
Re-identification [61.677153482995564]
We propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel.
It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
arXiv Detail & Related papers (2020-07-16T16:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.