LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
Understanding
- URL: http://arxiv.org/abs/2312.14074v1
- Date: Thu, 21 Dec 2023 17:52:12 GMT
- Title: LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
Understanding
- Authors: Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi
Li, Zehui Chen, Peng Gao, Yandong Guo and Shanghang Zhang
- Abstract summary: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding.
In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs.
The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem.
- Score: 36.66305190056456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Large Language Models (LLMs) and Multimodal Large Language Models
(MLLMs) have shown promise in instruction following and 2D image understanding.
While these models are powerful, they have not yet been developed to comprehend
the more challenging 3D physical scenes, especially when it comes to the sparse
outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw
LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs
to gain a comprehensive understanding of outdoor 3D scenes. The central insight
of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a
language modeling problem, encompassing tasks such as 3D captioning, 3D
grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D
LiDAR-text pairing data, we introduce a three-stage training strategy and
generate relevant datasets, progressively aligning the 3D modality with the
language embedding space of LLM. Furthermore, we design a View-Aware
Transformer (VAT) to connect the 3D encoder with the LLM, which effectively
bridges the modality gap and enhances the LLM's spatial orientation
comprehension of visual features. Our experiments show that LiDAR-LLM possesses
favorable capabilities to comprehend various instructions regarding 3D scenes
and engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the
3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\%
BEV mIoU on the 3D grounding task. Web page:
https://sites.google.com/view/lidar-llm
Related papers
- 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs.
We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z) - LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models [62.85566496673856]
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model.
A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly.
Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format.
arXiv Detail & Related papers (2024-11-14T17:08:23Z) - SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks.
We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.
In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.
We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.