3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
- URL: http://arxiv.org/abs/2308.04352v1
- Date: Tue, 8 Aug 2023 15:59:17 GMT
- Title: 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
- Authors: Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
- Abstract summary: 3D-VisTA is a pre-trained Transformer for 3D Vision and Text Alignment.
ScanScribe is the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training.
- Score: 44.00343134325925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D vision-language grounding (3D-VL) is an emerging field that aims to
connect the 3D physical world with natural language, which is crucial for
achieving embodied intelligence. Current 3D-VL models rely heavily on
sophisticated modules, auxiliary losses, and optimization tricks, which calls
for a simple and unified model. In this paper, we propose 3D-VisTA, a
pre-trained Transformer for 3D Vision and Text Alignment that can be easily
adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention
layers for both single-modal modeling and multi-modal fusion without any
sophisticated task-specific design. To further enhance its performance on 3D-VL
tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs
dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185
unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with
paired 278K scene descriptions generated from existing 3D-VL tasks, templates,
and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object
modeling and scene-text matching. It achieves state-of-the-art results on
various 3D-VL tasks, ranging from visual grounding and dense captioning to
question answering and situated reasoning. Moreover, 3D-VisTA demonstrates
superior data efficiency, obtaining strong performance even with limited
annotations during downstream task fine-tuning.
Related papers
- 3D Vision and Language Pretraining with Large-Scale Synthetic Data [28.45763758308814]
3D Vision-Language Pre-training aims to provide a pre-train model which can bridge 3D scenes with natural language.
We construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels.
We propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift.
arXiv Detail & Related papers (2024-07-08T16:26:52Z) - Unifying 3D Vision-Language Understanding via Promptable Queries [39.55438547712157]
unified model for 3D vision-language (3D-VL) understanding.
PQ3D is capable of using Promptable Queries to tackle a wide range of 3D-VL tasks.
Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks.
arXiv Detail & Related papers (2024-05-19T04:35:05Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans [6.936271803454143]
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
arXiv Detail & Related papers (2023-05-23T09:52:49Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.