Related papers: VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

URL: http://arxiv.org/abs/2406.05543v1
Date: Sat, 8 Jun 2024 18:17:09 GMT
Title: VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
Authors: Jianmeng Liu, Yichen Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang,
Abstract summary: Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
Score: 56.211321810408194
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.

Related papers

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh [79.20802127426003]
MeshLLM is a framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes.<n>We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits.<n> Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding.
arXiv Detail & Related papers (2025-08-02T07:37:37Z)
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs. We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z)
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models [62.85566496673856]
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format.
arXiv Detail & Related papers (2024-11-14T17:08:23Z)
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z)
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs) It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding [36.66305190056456]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem.
arXiv Detail & Related papers (2023-12-21T17:52:12Z)
GPT4Point: A Unified Framework for Point-Language Understanding and Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z)
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.