Related papers: Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

URL: http://arxiv.org/abs/2409.03757v2
Date: Sat, 23 Nov 2024 01:18:33 GMT
Title: Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang,
Abstract summary: We present a comprehensive study that probes various visual encoding models for 3D scene understanding. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, geometric diffusion models benefit tasks, and language-pretrained models show unexpected limitations in language-related tasks.
Score: 41.59673370285659
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D

Related papers

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding [59.020450264301026]
VideoLLaMA3 is a more advanced multimodal foundation model for image and video understanding. VideoLLaMA3 has four training stages: Vision Adaptation, Vision-Language Alignment, Fine-tuning, and Video-centric Fine-tuning. VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
arXiv Detail & Related papers (2025-01-22T18:59:46Z)
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences [70.0873383646651]
LSceneLLM is an adaptive framework that automatically identifies task-relevant areas. A dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information.
arXiv Detail & Related papers (2024-12-02T09:07:57Z)
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z)
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z)
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding [47.48443919164377]
A vision-language pre-training framework is proposed to transfer flexibly on 3D vision-language downstream tasks. In this paper, we investigate three common tasks in semantic 3D scene understanding, and derive key insights into a pre-training model. Experiments verify the excellent performance of the framework on three 3D vision-language tasks.
arXiv Detail & Related papers (2023-05-18T05:25:40Z)
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3. It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.