Related papers: LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

URL: http://arxiv.org/abs/2511.16454v1
Date: Thu, 20 Nov 2025 15:22:22 GMT
Title: LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs
Authors: Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe,
Abstract summary: We introduce LLaVA$3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of vision-language models.<n>Inspired by Cubist painters, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object.
Score: 4.332158627306896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

Related papers

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure.<n>Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z)
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations.<n>Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z)
Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding. We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z)
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings. We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.