Vision-Language Pre-training with Object Contrastive Learning for 3D
Scene Understanding
- URL: http://arxiv.org/abs/2305.10714v1
- Date: Thu, 18 May 2023 05:25:40 GMT
- Title: Vision-Language Pre-training with Object Contrastive Learning for 3D
Scene Understanding
- Authors: Taolin Zhang, Sunan He, Dai Tao, Bin Chen, Zhi Wang, Shu-Tao Xia
- Abstract summary: A vision-language pre-training framework is proposed to transfer flexibly on 3D vision-language downstream tasks.
In this paper, we investigate three common tasks in semantic 3D scene understanding, and derive key insights into a pre-training model.
Experiments verify the excellent performance of the framework on three 3D vision-language tasks.
- Score: 47.48443919164377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, vision language pre-training frameworks have made
significant progress in natural language processing and computer vision,
achieving remarkable performance improvement on various downstream tasks.
However, when extended to point cloud data, existing works mainly focus on
building task-specific models, and fail to extract universal 3D vision-language
embedding that generalize well. We carefully investigate three common tasks in
semantic 3D scene understanding, and derive key insights into the development
of a pre-training model. Motivated by these observations, we propose a
vision-language pre-training framework 3DVLP (3D vision-language pre-training
with object contrastive learning), which transfers flexibly on 3D
vision-language downstream tasks. 3DVLP takes visual grounding as the proxy
task and introduces Object-level IoU-guided Detection (OID) loss to obtain
high-quality proposals in the scene. Moreover, we design Object-level
Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive
learning (OSC) task to align the objects with descriptions and distinguish
different objects in the scene, respectively. Extensive experiments verify the
excellent performance of 3DVLP on three 3D vision-language tasks, reflecting
its superiority in semantic 3D scene understanding.
Related papers
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners [15.178598145436142]
We propose the Language-Regularized Concept Learner (LARC)
LARC uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners.
We show that LARC improves performance of prior works in naturally supervised 3D visual grounding.
arXiv Detail & Related papers (2024-04-30T16:44:18Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation [30.429893959096752]
We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation.
We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
arXiv Detail & Related papers (2022-01-26T07:43:47Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.