MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation
- URL: http://arxiv.org/abs/2509.16768v1
- Date: Sat, 20 Sep 2025 18:25:14 GMT
- Title: MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation
- Authors: Omid Bonakdar, Nasser Mozayani,
- Abstract summary: We introduce MMPart, an innovative framework for generating part-aware 3D models from a single image.<n> MMPart generates isolated images of each object based on the initial image and the previous step's prompts as supervisor.<n>A reconstruction model converts each of these multi-view images into a 3D model.
- Score: 0.6445605125467574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step's prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.
Related papers
- DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z) - CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation [58.46364872103992]
We introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model.<n>In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components.
arXiv Detail & Related papers (2025-05-11T14:54:26Z) - PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models [63.1432721793683]
We introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object.<n>We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin.
arXiv Detail & Related papers (2024-12-24T18:59:43Z) - Part123: Part-aware 3D Reconstruction from a Single-view Image [54.589723979757515]
Part123 is a novel framework for part-aware 3D reconstruction from a single-view image.
We introduce contrastive learning into a neural rendering framework to learn a part-aware feature space.
A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models.
arXiv Detail & Related papers (2024-05-27T07:10:21Z) - MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View [0.0]
This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model.
Our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
arXiv Detail & Related papers (2024-05-06T22:55:53Z) - ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion [61.37481051263816]
Given a single image of a 3D object, this paper proposes a method (named ConsistNet) that is able to generate multiple images of the same object.
Our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-10-16T12:29:29Z) - Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D
Data [76.38261311948649]
Viewset Diffusion is a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision.
We train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models.
The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset.
arXiv Detail & Related papers (2023-06-13T16:18:51Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.