Multi-View Transformer for 3D Visual Grounding
- URL: http://arxiv.org/abs/2204.02174v1
- Date: Tue, 5 Apr 2022 12:59:43 GMT
- Title: Multi-View Transformer for 3D Visual Grounding
- Authors: Shijia Huang, Yilun Chen, Jiaya Jia, Liwei Wang
- Abstract summary: We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
- Score: 64.30493173825234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The 3D visual grounding task aims to ground a natural language description to
the targeted object in a 3D scene, which is usually represented in 3D point
clouds. Previous works studied visual grounding under specific views. The
vision-language correspondence learned by this way can easily fail once the
view changes. In this paper, we propose a Multi-View Transformer (MVT) for 3D
visual grounding. We project the 3D scene to a multi-view space, in which the
position information of the 3D scene under different views are modeled
simultaneously and aggregated together. The multi-view space enables the
network to learn a more robust multi-modal representation for 3D visual
grounding and eliminates the dependence on specific views. Extensive
experiments show that our approach significantly outperforms all
state-of-the-art methods. Specifically, on Nr3D and Sr3D datasets, our method
outperforms the best competitor by 11.2% and 7.1% and even surpasses recent
work with extra 2D assistance by 5.9% and 6.6%. Our code is available at
https://github.com/sega-hsj/MVT-3DVG.
Related papers
- Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D [95.14469865815768]
2D vision models can be used for semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets.
However, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task.
In this paper, we propose Lift3D, which trains to predict unseen views on feature spaces generated by a few visual models.
We even outperform state-of-the-art methods specialized for the task in question.
arXiv Detail & Related papers (2024-03-27T18:13:16Z) - SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents.
We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes.
We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z) - SAT: 2D Semantics Assisted Training for 3D Visual Grounding [95.84637054325039]
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images.
We propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning.
arXiv Detail & Related papers (2021-05-24T17:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.