Related papers: 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

URL: http://arxiv.org/abs/2501.01163v2
Date: Thu, 24 Apr 2025 06:33:35 GMT
Title: 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Authors: Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid,
Abstract summary: We introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world.<n>At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities.
Score: 33.42183318484381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks.

Related papers

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning [18.185457833299235]
We propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. We first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects.
arXiv Detail & Related papers (2025-03-01T14:38:42Z)
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs. We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z)
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness [22.408933972095763]
Development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. We introduce a simple yet effective framework called LLaVA-3D, which efficiently adapts LLaVA for 3D scene understanding. LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets.
arXiv Detail & Related papers (2024-09-26T17:59:11Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration. An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding. We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z)
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction [37.0434133128805]
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++. ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
arXiv Detail & Related papers (2024-02-27T18:57:12Z)
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z)
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following [88.39360296377589]
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. We also present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions.
arXiv Detail & Related papers (2023-09-01T17:59:47Z)
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.