Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding
- URL: http://arxiv.org/abs/2410.15615v1
- Date: Mon, 21 Oct 2024 03:33:13 GMT
- Title: Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding
- Authors: Yang Liu, Daizong Liu, Wei Hu,
- Abstract summary: This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions.
Existing methods fall into two categories: top-down and bottom-up methods.
We propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency.
- Score: 28.55989894411032
- License:
- Abstract: This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.
Related papers
- Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - GOOD: General Optimization-based Fusion for 3D Object Detection via
LiDAR-Camera Object Candidates [10.534984939225014]
3D object detection serves as the core basis of the perception tasks in autonomous driving.
Good is a general optimization-based fusion framework that can achieve satisfying detection without training additional models.
Experiments on both nuScenes and KITTI datasets are carried out and the results show that GOOD outperforms by 9.1% on mAP score compared with PointPillars.
arXiv Detail & Related papers (2023-03-17T07:05:04Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - NeuralBF: Neural Bilateral Filtering for Top-down Instance Segmentation
on Point Clouds [44.258500431460924]
We introduce a method for instance proposal generation for 3D point clouds.
We show that this serves as a critical bottleneck, and propose a method based on iterative bilateral filtering with learned kernels.
arXiv Detail & Related papers (2022-07-20T15:37:32Z) - MCTS with Refinement for Proposals Selection Games in Scene
Understanding [32.92475660892122]
We propose a novel method applicable in many scene understanding problems that adapts the Monte Carlo Tree Search (MCTS) algorithm.
From a generated pool of proposals, our method jointly selects and optimize proposals that maximize the objective term.
Our method shows high performance on the Matterport3D dataset without introducing hard constraints on room layout configurations.
arXiv Detail & Related papers (2022-07-07T10:15:54Z) - RBGNet: Ray-based Grouping for 3D Object Detection [104.98776095895641]
We propose the RBGNet framework, a voting-based 3D detector for accurate 3D object detection from point clouds.
We propose a ray-based feature grouping module, which aggregates the point-wise features on object surfaces using a group of determined rays.
Our model achieves state-of-the-art 3D detection performance on ScanNet V2 and SUN RGB-D with remarkable performance gains.
arXiv Detail & Related papers (2022-04-05T14:42:57Z) - Learning Local Displacements for Point Cloud Completion [93.54286830844134]
We propose a novel approach aimed at object and semantic scene completion from a partial scan represented as a 3D point cloud.
Our architecture relies on three novel layers that are used successively within an encoder-decoder structure.
We evaluate both architectures on object and indoor scene completion tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-03-30T18:31:37Z) - Adaptive Proposal Generation Network for Temporal Sentence Localization
in Videos [58.83440885457272]
We address the problem of temporal sentence localization in videos (TSLV)
Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals.
We propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency.
arXiv Detail & Related papers (2021-09-14T02:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.