ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance
- URL: http://arxiv.org/abs/2303.16894v4
- Date: Tue, 5 Dec 2023 05:34:18 GMT
- Title: ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance
- Authors: Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao,
Xuelong Li
- Abstract summary: We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
- Score: 48.748738590964216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding 3D scenes from multi-view inputs has been proven to alleviate
the view discrepancy issue in 3D visual grounding. However, existing methods
normally neglect the view cues embedded in the text modality and fail to weigh
the relative importance of different views. In this paper, we propose
ViewRefer, a multi-view framework for 3D visual grounding exploring how to
grasp the view knowledge from both text and 3D modalities. For the text branch,
ViewRefer leverages the diverse linguistic knowledge of large-scale language
models, e.g., GPT, to expand a single grounding text to multiple
geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer
fusion module with inter-view attention is introduced to boost the interaction
of objects across views. On top of that, we further present a set of learnable
multi-view prototypes, which memorize scene-agnostic knowledge for different
views, and enhance the framework from two perspectives: a view-guided attention
module for more robust text features, and a view-guided scoring strategy during
the final prediction. With our designed paradigm, ViewRefer achieves superior
performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%,
and +1.35% on Sr3D, Nr3D, and ScanRefer. Code is released at
https://github.com/Ivan-Tang-3D/ViewRefer3D.
Related papers
- SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation.
We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level.
We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Mono3DVG: 3D Visual Grounding in Monocular Images [12.191320182791483]
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information.
We build a large-scale dataset, Mono3DRefer, which contains 3D object targets with corresponding geometric text descriptions.
We propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings.
arXiv Detail & Related papers (2023-12-13T09:49:59Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.