MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
- URL: http://arxiv.org/abs/2506.01946v1
- Date: Mon, 02 Jun 2025 17:58:24 GMT
- Title: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
- Authors: Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han,
- Abstract summary: 3DRS is a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models.<n>Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding.
- Score: 14.083262551714133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs -- including visual grounding, captioning, and question answering -- demonstrate consistent performance gains. Project page: https://visual-ai.github.io/3drs
Related papers
- Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors [23.66183317100899]
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos.<n>We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM)<n>Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences.
arXiv Detail & Related papers (2025-05-30T14:16:41Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs.<n>We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z) - Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations.<n>Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.