Related papers: Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

URL: http://arxiv.org/abs/2404.04363v2
Date: Wed, 18 Dec 2024 08:30:59 GMT
Title: Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs
Authors: Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao,
Abstract summary: We argue that current 3D AIGC methods do not fully unleash human creativity.<n>In this paper, we explore a novel 3D AIGC approach: generating 3D content from IDEAs.<n>We propose the new framework Idea23D, which combines three agents based on large multimodal models (LMMs) and existing algorithmic tools.
Score: 13.360196679265226
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the success of 2D diffusion models, 2D AIGC content has already transformed our lives. Recently, this success has been extended to 3D AIGC, with state-of-the-art methods generating textured 3D models from single images or text. However, we argue that current 3D AIGC methods still do not fully unleash human creativity. We often imagine 3D content made from multimodal inputs, such as what it would look like if my pet bunny were eating a doughnut on the table. In this paper, we explore a novel 3D AIGC approach: generating 3D content from IDEAs. An IDEA is a multimodal input composed of text, image, and 3D models. To our knowledge, this challenging and exciting 3D AIGC setting has not been studied before. We propose the new framework Idea23D, which combines three agents based on large multimodal models (LMMs) and existing algorithmic tools. These three LMM-based agents are tasked with prompt generation, model selection, and feedback reflection. They collaborate and critique each other in a fully automated loop, without human intervention. The framework then generates a text prompt to create 3D models that align closely with the input IDEAs. We demonstrate impressive 3D AIGC results that surpass previous methods. To comprehensively assess the 3D AIGC capabilities of Idea23D, we introduce the Eval3DAIGC-198 dataset, containing 198 multimodal inputs for 3D generation tasks. This dataset evaluates the alignment between generated 3D content and input IDEAs. Our user study and quantitative results show that Idea23D significantly improves the success rate and accuracy of 3D generation, with excellent compatibility across various LMM, Text-to-Image, and Image-to-3D models. Code and dataset are available at \url{https://idea23d.github.io/}.

Related papers

Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion [31.888133775976414]
We tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model.<n>Most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources.<n>We propose category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models.
arXiv Detail & Related papers (2025-09-02T14:19:21Z)
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding [16.95099884066268]
ShapeLLM- Omni is a native 3D large language model capable of understanding and generating 3D assets and text in any sequence.<n>Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca.<n>Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI.
arXiv Detail & Related papers (2025-06-02T16:40:50Z)
Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning. UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE [28.597376637565123]
This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality.
arXiv Detail & Related papers (2024-11-25T19:00:05Z)
Any-to-3D Generation via Hybrid Diffusion Supervision [67.54197818071464]
XBind is a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities.
arXiv Detail & Related papers (2024-11-22T03:52:37Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration. An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data. We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z)
Retrieval-Augmented Score Distillation for Text-to-3D Generation [30.57225047257049]
We introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency.
arXiv Detail & Related papers (2024-02-05T12:50:30Z)
Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human [51.58094069317723]
This paper aims to provide a comprehensive overview and summary of the relevant papers published mostly during the latter half year of 2023. It will begin by discussing the AI generated object models in 3D, followed by the generated 3D human models, and finally, the generated 3D human motions, culminating in a conclusive summary and a vision for the future.
arXiv Detail & Related papers (2024-01-05T03:41:38Z)
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space. We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision. We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z)
A Convolutional Architecture for 3D Model Embedding [1.3858051019755282]
We propose a deep learning architecture to handle 3D models as an input. We show that the embedding representation conveys semantic information that helps to deal with the similarity assessment of 3D objects.
arXiv Detail & Related papers (2021-03-05T15:46:47Z)
Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery. Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.