Related papers: Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

URL: http://arxiv.org/abs/2503.16278v3
Date: Thu, 09 Oct 2025 02:59:45 GMT
Title: Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Yitao Liang, Weinan E, Linfeng Zhang, Guolin Ke,
Abstract summary: We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding.<n>At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences.<n>To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy.
Score: 32.45851798752336
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: 3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.

Related papers

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z)
LATTICE: Democratize High-Fidelity 3D Generation at Scale [27.310104395842075]
LATTICE is a new framework for high-fidelity 3D asset generation.<n> VoxSet is a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid.<n>Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes.
arXiv Detail & Related papers (2025-11-24T03:22:19Z)
Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification [59.17489431187807]
We propose a framework that enhances 3D geometric fidelity by leveraging CLIP's hierarchical spatial semantics.<n>Our method significantly improves 3D few-shot class-incremental learning, achieving superior geometric coherence and robustness to texture bias.
arXiv Detail & Related papers (2025-09-18T13:45:08Z)
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding [16.95099884066268]
ShapeLLM- Omni is a native 3D large language model capable of understanding and generating 3D assets and text in any sequence.<n>Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca.<n>Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI.
arXiv Detail & Related papers (2025-06-02T16:40:50Z)
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning [27.40106634796608]
Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning.<n>Currently, 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies.<n>We propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens.
arXiv Detail & Related papers (2025-05-19T07:11:07Z)
OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation [24.980804600194062]
OctGPT is a novel multiscale autoregressive model for 3D shape generation. It dramatically improves the efficiency and performance of prior 3D autoregressive approaches. It offers a new paradigm for high-quality, scalable 3D content creation.
arXiv Detail & Related papers (2025-04-14T08:31:26Z)
SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets [72.26350984924129]
We propose a latent space generation paradigm for 3D human digitization.<n>We transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift.<n>We employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset.
arXiv Detail & Related papers (2025-04-09T15:38:18Z)
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation [61.32714172038278]
We propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D
arXiv Detail & Related papers (2025-03-30T06:15:36Z)
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning [21.77406648840365]
DeepMesh is a framework that optimize mesh generation through two key innovations.<n>It incorporates a novel tokenization algorithm, along with improvements in data curation and processing.<n>It generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality.
arXiv Detail & Related papers (2025-03-19T14:39:30Z)
ArchComplete: Autoregressive 3D Architectural Design Generation with Hierarchical Diffusion-Based Upsampling [0.0]
ArchComplete is a two-stage voxel-based 3D generative pipeline consisting of a vector-quantised model.<n>Key to our pipeline is (i) learning a contextually rich codebook of local patch embeddings, optimised alongside a 2.5D perceptual loss.<n> ArchComplete autoregressively generates models at the resolution of $643$ and progressively refines them up to $5123$, with voxel sizes as small as $ approx 9textcm$.
arXiv Detail & Related papers (2024-12-23T20:13:27Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts. Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z)
SC-Diff: 3D Shape Completion with Latent Diffusion Models [4.261508855254493]
We present a novel 3D shape completion framework that unifies multimodal conditioning.<n>Shapes are represented as Truncated Signed Distance Functions (TSDFs) and encoded into a discrete latent space jointly supervised by 2D and 3D cues.<n>Our approach guides the generation process with flexible multimodal conditioning, ensuring consistent integration of 2D and 3D information.
arXiv Detail & Related papers (2024-03-19T06:01:11Z)
Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z)
Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention [9.52027244702166]
Spice-E is a neural network that adds structural guidance to 3D diffusion models. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D.
arXiv Detail & Related papers (2023-11-29T17:36:49Z)
Instant3D: Instant Text-to-3D Generation [101.25562463919795]
We propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network.
arXiv Detail & Related papers (2023-11-14T18:59:59Z)
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving [74.34701012543968]
We present UniPAD, a novel self-supervised learning paradigm applying 3D differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively.
arXiv Detail & Related papers (2023-10-12T14:39:58Z)
UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z)
Learning Versatile 3D Shape Generation with Improved AR Models [91.87115744375052]
Auto-regressive (AR) models have achieved impressive results in 2D image generation by modeling joint distributions in the grid space. We propose the Improved Auto-regressive Model (ImAM) for 3D shape generation, which applies discrete representation learning based on a latent vector instead of volumetric grids.
arXiv Detail & Related papers (2023-03-26T12:03:18Z)
Hierarchical Graph Networks for 3D Human Pose Estimation [50.600944798627786]
Recent 2D-to-3D human pose estimation works tend to utilize the graph structure formed by the topology of the human skeleton. We argue that this skeletal topology is too sparse to reflect the body structure and suffer from serious 2D-to-3D ambiguity problem. We propose a novel graph convolution network architecture, Hierarchical Graph Networks, to overcome these weaknesses.
arXiv Detail & Related papers (2021-11-23T15:09:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.