CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
- URL: http://arxiv.org/abs/2601.21798v1
- Date: Thu, 29 Jan 2026 14:42:46 GMT
- Title: CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
- Authors: Junming Huang, Weiwei Xu,
- Abstract summary: CG-MLLM is a novel Large Language Model capable of 3D captioning and high-resolution 3D generation in a single framework.<n>By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks.
- Score: 18.035268191933117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.
Related papers
- Exploring MLLM-Diffusion Information Transfer with MetaCanvas [66.28602082523464]
We propose a lightweight framework that lets MLLMs reason and plan directly in spatial and multimodal latent spaces.<n>We evaluate it across six visual generation tasks, including text-to-image generation, text/image-to-video generation, image/video attribute editing, and in-context video generation.
arXiv Detail & Related papers (2025-12-12T11:07:11Z) - S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance [20.55536735670125]
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG.<n>We propose S$2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning.
arXiv Detail & Related papers (2025-12-01T03:08:34Z) - REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting [16.896443736904356]
Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions.<n>We introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation.<n>Our framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer.
arXiv Detail & Related papers (2025-10-18T08:53:08Z) - Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy [4.1703677379815565]
We propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data.<n>In our method, the geometric prior are directly used to improve the performance of the sceen perception.<n>Experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Captioning and 3D Visual Grounding tasks.
arXiv Detail & Related papers (2025-09-29T07:34:18Z) - MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh [79.20802127426003]
MeshLLM is a framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes.<n>We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits.<n> Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding.
arXiv Detail & Related papers (2025-08-02T07:37:37Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback [18.857087708269038]
Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation.<n>SDS-based methods struggle to maintain semantic fidelity for user prompts.<n>We propose Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs)
arXiv Detail & Related papers (2025-04-28T14:50:45Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models [62.85566496673856]
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model.
A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly.
Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format.
arXiv Detail & Related papers (2024-11-14T17:08:23Z) - VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.