Related papers: Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

URL: http://arxiv.org/abs/2305.10714v1
Date: Thu, 18 May 2023 05:25:40 GMT
Title: Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Authors: Taolin Zhang, Sunan He, Dai Tao, Bin Chen, Zhi Wang, Shu-Tao Xia
Abstract summary: A vision-language pre-training framework is proposed to transfer flexibly on 3D vision-language downstream tasks. In this paper, we investigate three common tasks in semantic 3D scene understanding, and derive key insights into a pre-training model. Experiments verify the excellent performance of the framework on three 3D vision-language tasks.
Score: 47.48443919164377
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding.

Related papers

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding [31.40722103849691]
MPEC is a novel learning method for open-vocabulary 3D semantic segmentation. It uses both 3D entity-language alignment and point-entity consistency across different point cloud views. Our method achieves state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation.
arXiv Detail & Related papers (2025-04-28T05:43:14Z)
A Review of 3D Object Detection with Vision-Language Models [0.31457219084519]
We provide the first systematic analysis dedicated to 3D object detection with vision-language models. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs. We highlight current challenges, such as limited 3D-language datasets and computational demands.
arXiv Detail & Related papers (2025-04-25T23:27:26Z)
3D Scene Graph Guided Vision-Language Pre-training [11.131667398927394]
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. This paper proposes a 3D scene graph-guided vision-language pre-training framework.
arXiv Detail & Related papers (2024-11-27T16:10:44Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z)
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners [15.178598145436142]
We propose the Language-Regularized Concept Learner (LARC) LARC uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding.
arXiv Detail & Related papers (2024-04-30T16:44:18Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation [30.429893959096752]
We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
arXiv Detail & Related papers (2022-01-26T07:43:47Z)
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data. We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.