Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
- URL: http://arxiv.org/abs/2511.13647v1
- Date: Mon, 17 Nov 2025 17:59:52 GMT
- Title: Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
- Authors: Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo,
- Abstract summary: Part-X-MLLM is a native 3D multimodal large language model.<n>It unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar.
- Score: 35.75184591224847
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
Related papers
- CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models [18.035268191933117]
CG-MLLM is a novel Large Language Model capable of 3D captioning and high-resolution 3D generation in a single framework.<n>By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks.
arXiv Detail & Related papers (2026-01-29T14:42:46Z) - PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding [67.15800065888887]
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning.<n>We introduce an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds.<n>Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering.
arXiv Detail & Related papers (2026-01-05T18:55:45Z) - PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data [47.60227259482637]
We present PartSAM, the first promptable part segmentation model trained on large-scale 3D data.<n>PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens.<n>To enable large-scale supervision, we introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs.
arXiv Detail & Related papers (2025-09-26T06:52:35Z) - MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds [50.98900790623827]
MeshCoder is a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts.<n>We train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts.<n>Our approach achieves superior performance in shape-to-code reconstruction tasks and also facilitates intuitive geometric and topological editing.
arXiv Detail & Related papers (2025-08-20T17:50:15Z) - MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh [79.20802127426003]
MeshLLM is a framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes.<n>We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits.<n> Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding.
arXiv Detail & Related papers (2025-08-02T07:37:37Z) - OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion [31.767548415448957]
We introduce OmniPart, a novel framework for part-aware 3D object generation.<n>Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications.
arXiv Detail & Related papers (2025-07-08T16:46:15Z) - Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models [16.828694984680553]
Programmable-Room is a framework which interactively generates and edits a 3D room mesh, given natural language instructions.<n>For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes.<n>To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP)
arXiv Detail & Related papers (2025-06-21T13:00:06Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability.
Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder.
Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z) - Locally Adaptive Neural 3D Morphable Models [38.38400553022714]
We present the Locally Adaptive Morphable Model (LAMM), a framework for learning to generate and manipulate 3D meshes.
A very efficient computational graph allows our network to train with only a fraction of the memory required by previous methods.
We further leverage local geometry control as a primitive for higher level editing operations and present a set of derivative capabilities.
arXiv Detail & Related papers (2024-01-05T18:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.