Related papers: CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

URL: http://arxiv.org/abs/2601.21798v1
Date: Thu, 29 Jan 2026 14:42:46 GMT
Title: CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
Authors: Junming Huang, Weiwei Xu,
Abstract summary: CG-MLLM is a novel Large Language Model capable of 3D captioning and high-resolution 3D generation in a single framework.<n>By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks.
Score: 18.035268191933117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

Related papers

Exploring MLLM-Diffusion Information Transfer with MetaCanvas [66.28602082523464]
We propose a lightweight framework that lets MLLMs reason and plan directly in spatial and multimodal latent spaces.<n>We evaluate it across six visual generation tasks, including text-to-image generation, text/image-to-video generation, image/video attribute editing, and in-context video generation.
arXiv Detail & Related papers (2025-12-12T11:07:11Z)
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance [20.55536735670125]
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG.<n>We propose S$2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning.
arXiv Detail & Related papers (2025-12-01T03:08:34Z)
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting [16.896443736904356]
Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions.<n>We introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation.<n>Our framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer.
arXiv Detail & Related papers (2025-10-18T08:53:08Z)
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy [4.1703677379815565]
We propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data.<n>In our method, the geometric prior are directly used to improve the performance of the sceen perception.<n>Experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Captioning and 3D Visual Grounding tasks.
arXiv Detail & Related papers (2025-09-29T07:34:18Z)
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh [79.20802127426003]
MeshLLM is a framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes.<n>We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits.<n> Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding.
arXiv Detail & Related papers (2025-08-02T07:37:37Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback [18.857087708269038]
Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation.<n>SDS-based methods struggle to maintain semantic fidelity for user prompts.<n>We propose Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs)
arXiv Detail & Related papers (2025-04-28T14:50:45Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models [62.85566496673856]
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format.
arXiv Detail & Related papers (2024-11-14T17:08:23Z)
VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.