VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
- URL: http://arxiv.org/abs/2406.05543v1
- Date: Sat, 8 Jun 2024 18:17:09 GMT
- Title: VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
- Authors: Jianmeng Liu, Yichen Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang,
- Abstract summary: Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
- Score: 56.211321810408194
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
Related papers
- SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks.
We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z) - More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding [22.753452376062565]
GreenPLM aims to enable robust 3D object understanding with minimal 3D point cloud and text data pairs.
Inspired by CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space.
We generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities.
arXiv Detail & Related papers (2024-08-28T17:38:44Z) - Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model [52.27297680947337]
Multimodal language models (MLLMs) are increasingly being implemented in real-world environments.
Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions.
We introduce Coarse Correspondence, a training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
Understanding [36.66305190056456]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding.
In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs.
The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem.
arXiv Detail & Related papers (2023-12-21T17:52:12Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.