Related papers: Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

URL: http://arxiv.org/abs/2511.13647v1
Date: Mon, 17 Nov 2025 17:59:52 GMT
Title: Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Authors: Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo,
Abstract summary: Part-X-MLLM is a native 3D multimodal large language model.<n>It unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar.
Score: 35.75184591224847
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

Related papers

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models [18.035268191933117]
CG-MLLM is a novel Large Language Model capable of 3D captioning and high-resolution 3D generation in a single framework.<n>By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks.
arXiv Detail & Related papers (2026-01-29T14:42:46Z)
PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding [67.15800065888887]
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning.<n>We introduce an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds.<n>Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering.
arXiv Detail & Related papers (2026-01-05T18:55:45Z)
PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data [47.60227259482637]
We present PartSAM, the first promptable part segmentation model trained on large-scale 3D data.<n>PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens.<n>To enable large-scale supervision, we introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs.
arXiv Detail & Related papers (2025-09-26T06:52:35Z)
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds [50.98900790623827]
MeshCoder is a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts.<n>We train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts.<n>Our approach achieves superior performance in shape-to-code reconstruction tasks and also facilitates intuitive geometric and topological editing.
arXiv Detail & Related papers (2025-08-20T17:50:15Z)
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh [79.20802127426003]
MeshLLM is a framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes.<n>We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits.<n> Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding.
arXiv Detail & Related papers (2025-08-02T07:37:37Z)
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion [31.767548415448957]
We introduce OmniPart, a novel framework for part-aware 3D object generation.<n>Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications.
arXiv Detail & Related papers (2025-07-08T16:46:15Z)
Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models [16.828694984680553]
Programmable-Room is a framework which interactively generates and edits a 3D room mesh, given natural language instructions.<n>For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes.<n>To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP)
arXiv Detail & Related papers (2025-06-21T13:00:06Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z)
Locally Adaptive Neural 3D Morphable Models [38.38400553022714]
We present the Locally Adaptive Morphable Model (LAMM), a framework for learning to generate and manipulate 3D meshes. A very efficient computational graph allows our network to train with only a fraction of the memory required by previous methods. We further leverage local geometry control as a primitive for higher level editing operations and present a set of derivative capabilities.
arXiv Detail & Related papers (2024-01-05T18:28:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.