ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance
- URL: http://arxiv.org/abs/2303.16894v4
- Date: Tue, 5 Dec 2023 05:34:18 GMT
- Title: ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance
- Authors: Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao,
Xuelong Li
- Abstract summary: We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
- Score: 48.748738590964216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding 3D scenes from multi-view inputs has been proven to alleviate
the view discrepancy issue in 3D visual grounding. However, existing methods
normally neglect the view cues embedded in the text modality and fail to weigh
the relative importance of different views. In this paper, we propose
ViewRefer, a multi-view framework for 3D visual grounding exploring how to
grasp the view knowledge from both text and 3D modalities. For the text branch,
ViewRefer leverages the diverse linguistic knowledge of large-scale language
models, e.g., GPT, to expand a single grounding text to multiple
geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer
fusion module with inter-view attention is introduced to boost the interaction
of objects across views. On top of that, we further present a set of learnable
multi-view prototypes, which memorize scene-agnostic knowledge for different
views, and enhance the framework from two perspectives: a view-guided attention
module for more robust text features, and a view-guided scoring strategy during
the final prediction. With our designed paradigm, ViewRefer achieves superior
performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%,
and +1.35% on Sr3D, Nr3D, and ScanRefer. Code is released at
https://github.com/Ivan-Tang-3D/ViewRefer3D.
Related papers
- Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
Existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries.
We propose textbf3D-VLA, a weakly supervised approach for textbf3D visual grounding based on textbfVisual textbfLinguistic textbfAlignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Mono3DVG: 3D Visual Grounding in Monocular Images [12.191320182791483]
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information.
We build a large-scale dataset, Mono3DRefer, which contains 3D object targets with corresponding geometric text descriptions.
We propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings.
arXiv Detail & Related papers (2023-12-13T09:49:59Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.