Related papers: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

URL: http://arxiv.org/abs/2508.05599v1
Date: Thu, 07 Aug 2025 17:41:26 GMT
Title: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Authors: Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Ying Zhang, Chen Li, Yali Wang,
Abstract summary: We introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers.<n>We partition the latent features into groups, and perform lookup-free quantization for each group.<n>Generative Decoding (GD) can probabilistically model the distribution of visual data conditioned on discrete tokens.
Score: 15.687542914511488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.

Related papers

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model [50.68870074090426]
We introduce UniWeTok, a unified discrete tokenizer for Unified Multimodal Large Language Models.<n>For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens.<n>We propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios.
arXiv Detail & Related papers (2026-02-15T15:07:19Z)
Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models [8.059385582452112]
FOCUS (Feature-Oriented Compression for Ultra-long Self-attention) is a progressive context-compression module that can be plugged into pretrained DNA LLMs.<n>On held-out human chromosomes, FOCUS achieves near-lossless fidelity.<n>Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N2) to near-linear O(N)
arXiv Detail & Related papers (2025-11-18T17:29:39Z)
REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z)
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model [59.065471969232284]
We propose a novel Aligned Tokenizer (AliTok) to align the tokenizer and autoregressive model.<n>On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator, AliTok achieves a gFID score of 1.50 and an IS of 305.9.<n>When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed.
arXiv Detail & Related papers (2025-06-05T17:45:10Z)
CODA: Repurposing Continuous VAEs for Discrete Tokenization [52.58960429582813]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis [57.7367843129838]
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer.<n>We propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction.
arXiv Detail & Related papers (2025-03-11T12:09:11Z)
UniTok: A Unified Tokenizer for Visual Generation and Understanding [69.09699034036124]
Visual generative and understanding models typically rely on distinct tokenizers to process images.<n>We introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism.<n>In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2025-02-27T17:47:01Z)
Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z)
GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.<n>In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.<n> Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.