GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
- URL: http://arxiv.org/abs/2501.01428v3
- Date: Thu, 09 Jan 2025 16:41:07 GMT
- Title: GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
- Authors: Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, Hengshuang Zhao,
- Abstract summary: 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks.
Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results.
We propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding.
- Score: 39.488763757826426
- License:
- Abstract: In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, which is critical for embodied intelligence, remains limited. Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial knowledge, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D spatial understanding of indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV) image from the video and marks consistent object IDs across both frames and the BEV image. The model then inputs the concatenated BEV image and video frames with markers. In zero-shot evaluations, GPT4Scene improves performance over closed-source VLMs like GPT-4o. Additionally, we prepare a processed video dataset consisting of 165K text annotation to fine-tune open-source VLMs, achieving state-of-the-art performance on all 3D understanding tasks. Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently improve during inference, even without visual prompting and BEV image as explicit correspondence. It demonstrates that the proposed paradigm helps VLMs develop an intrinsic ability to understand 3D scenes, which paves the way for a noninvasive approach to extending pre-trained VLMs for 3D scene understanding.
Related papers
- 3D Scene Graph Guided Vision-Language Pre-training [11.131667398927394]
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions.
Existing approaches typically follow task-specific, highly specialized paradigms.
This paper proposes a 3D scene graph-guided vision-language pre-training framework.
arXiv Detail & Related papers (2024-11-27T16:10:44Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.