3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation
- URL: http://arxiv.org/abs/2412.02202v1
- Date: Tue, 03 Dec 2024 06:31:25 GMT
- Title: 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation
- Authors: Jinzhi Zhang, Feng Xiong, Mu Xu,
- Abstract summary: Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy.
VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
- Score: 4.221298212125194
- License:
- Abstract: Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.
Related papers
- Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers [55.87192133758051]
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency.
We propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios.
arXiv Detail & Related papers (2024-12-22T02:04:17Z) - Attamba: Attending To Multi-Token States [6.5676809841642125]
We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens.
We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking.
Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
arXiv Detail & Related papers (2024-11-26T18:52:06Z) - HART: Efficient Visual Generation with Hybrid Autoregressive Transformer [33.97880303341509]
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images.
Our approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38.
HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
arXiv Detail & Related papers (2024-10-14T17:59:42Z) - An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting [68.94594215660473]
We propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS)
We exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms.
Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality.
arXiv Detail & Related papers (2024-04-15T04:50:39Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS [55.85673901231235]
We introduce LightGaussian, a method for transforming 3D Gaussians into a more compact format.
Inspired by Network Pruning, LightGaussian identifies Gaussians with minimal global significance on scene reconstruction.
LightGaussian achieves an average 15x compression rate while boosting FPS from 144 to 237 within the 3D-GS framework.
arXiv Detail & Related papers (2023-11-28T21:39:20Z) - Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image
Classification Using Transformers [0.11219061154635457]
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen.
transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information.
We propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches.
arXiv Detail & Related papers (2023-05-11T16:42:24Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity.
We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives.
We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.