LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
Understanding
- URL: http://arxiv.org/abs/2312.14074v1
- Date: Thu, 21 Dec 2023 17:52:12 GMT
- Title: LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
Understanding
- Authors: Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi
Li, Zehui Chen, Peng Gao, Yandong Guo and Shanghang Zhang
- Abstract summary: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding.
In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs.
The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem.
- Score: 36.66305190056456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Large Language Models (LLMs) and Multimodal Large Language Models
(MLLMs) have shown promise in instruction following and 2D image understanding.
While these models are powerful, they have not yet been developed to comprehend
the more challenging 3D physical scenes, especially when it comes to the sparse
outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw
LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs
to gain a comprehensive understanding of outdoor 3D scenes. The central insight
of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a
language modeling problem, encompassing tasks such as 3D captioning, 3D
grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D
LiDAR-text pairing data, we introduce a three-stage training strategy and
generate relevant datasets, progressively aligning the 3D modality with the
language embedding space of LLM. Furthermore, we design a View-Aware
Transformer (VAT) to connect the 3D encoder with the LLM, which effectively
bridges the modality gap and enhances the LLM's spatial orientation
comprehension of visual features. Our experiments show that LiDAR-LLM possesses
favorable capabilities to comprehend various instructions regarding 3D scenes
and engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the
3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\%
BEV mIoU on the 3D grounding task. Web page:
https://sites.google.com/view/lidar-llm
Related papers
- SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks.
We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z) - LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories.
We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.