Related papers: ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

URL: http://arxiv.org/abs/2402.17766v3
Date: Fri, 12 Jul 2024 15:36:15 GMT
Title: ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Authors: Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma,
Abstract summary: This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++. ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
Score: 37.0434133128805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/

Related papers

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding [16.95099884066268]
ShapeLLM- Omni is a native 3D large language model capable of understanding and generating 3D assets and text in any sequence.<n>Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca.<n>Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI.
arXiv Detail & Related papers (2025-06-02T16:40:50Z)
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors [23.66183317100899]
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos.<n>We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM)<n>Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences.
arXiv Detail & Related papers (2025-05-30T14:16:41Z)
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp" It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z)
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z)
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer [33.42183318484381]
We introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities.
arXiv Detail & Related papers (2025-01-02T09:33:13Z)
Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z)
Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z)
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z)
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following [88.39360296377589]
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. We also present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions.
arXiv Detail & Related papers (2023-09-01T17:59:47Z)
A Unified Framework for 3D Point Cloud Visual Grounding [60.75319271082741]
This paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3DRefTR. Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. This elaborate design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model.
arXiv Detail & Related papers (2023-08-23T03:20:31Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds. We empirically show their aptitude to boost the quality of the learned visual representations. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z)
TANDEM3D: Active Tactile Exploration for 3D Object Recognition [16.548376556543015]
We propose TANDEM3D, a method that applies a co-training framework for 3D object recognition with tactile signals. TANDEM3D is based on a novel encoder that builds 3D object representation from contact positions and normals using PointNet++. Our method is trained entirely in simulation and validated with real-world experiments.
arXiv Detail & Related papers (2022-09-19T05:54:26Z)
Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery. Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.