Related papers: 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

URL: http://arxiv.org/abs/2412.02202v1
Date: Tue, 03 Dec 2024 06:31:25 GMT
Title: 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation
Authors: Jinzhi Zhang, Feng Xiong, Mu Xu,
Abstract summary: Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy.<n>VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
Score: 4.221298212125194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation [24.980804600194062]
OctGPT is a novel multiscale autoregressive model for 3D shape generation. It dramatically improves the efficiency and performance of prior 3D autoregressive approaches. It offers a new paradigm for high-quality, scalable 3D content creation.
arXiv Detail & Related papers (2025-04-14T08:31:26Z)
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z)
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization [68.07464514094299]
Existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. We introduce Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality.
arXiv Detail & Related papers (2025-04-03T17:57:52Z)
MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation [44.94438766074643]
We introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens.
arXiv Detail & Related papers (2025-03-26T13:00:51Z)
Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models [21.97308739556984]
COD-VAE encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.
arXiv Detail & Related papers (2025-03-11T06:29:39Z)
Attamba: Attending To Multi-Token States [6.5676809841642125]
We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
arXiv Detail & Related papers (2024-11-26T18:52:06Z)
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer [33.97880303341509]
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images. Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
arXiv Detail & Related papers (2024-10-14T17:59:42Z)
EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation [36.69567056569989]
We propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of $5123$. We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization.
arXiv Detail & Related papers (2024-09-26T17:55:02Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences. TiTok achieves competitive performance to state-of-the-art approaches. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z)
CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting [68.94594215660473]
We propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS) We exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality.
arXiv Detail & Related papers (2024-04-15T04:50:39Z)
Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z)
LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS [55.85673901231235]
We introduce LightGaussian, a method for transforming 3D Gaussians into a more compact format. Inspired by Network Pruning, LightGaussian identifies Gaussians with minimal global significance on scene reconstruction. LightGaussian achieves an average 15x compression rate while boosting FPS from 144 to 237 within the 3D-GS framework.
arXiv Detail & Related papers (2023-11-28T21:39:20Z)
Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image Classification Using Transformers [0.11219061154635457]
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen. transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information. We propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches.
arXiv Detail & Related papers (2023-05-11T16:42:24Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity. We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives. We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.