Fugu-MT 論文翻訳(概要): WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

論文の概要: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

arxiv url: http://arxiv.org/abs/2508.05599v1
Date: Thu, 07 Aug 2025 17:41:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-08 18:59:39.972229
Title: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Title（参考訳）: WeTok:高忠実度視覚再建のための強力な離散的トークン化
Authors: Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Ying Zhang, Chen Li, Yali Wang,
Abstract要約: WeTokトークンは,従来の主要なトークンを超越した,強力で簡潔なトークンである。潜在特徴をグループに分割し、各グループに対してルックアップフリーな量子化を行う。生成デコーディング(GD)は、離散トークン上で条件付けられた視覚データの分布を確率的にモデル化することができる。
参考スコア（独自算出の注目度）: 15.687542914511488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.
Abstract（参考訳）: 視覚トークン化器は視覚生成にとって重要なコンポーネントである。しかし、既存のトークン化器は圧縮比と復元率の間の不満足なトレードオフに直面していることが多い。このギャップを埋めるために、私たちは2つのコアイノベーションを通じて従来の主要なトークン化ツールを上回る、強力で簡潔なWeTokトークン化ツールを導入しました。 1)グループワイドなルックアップフリー量子化(GQ)。潜在特徴をグループに分割し、各グループに対してルックアップフリーな量子化を行う。その結果、GQは、よりスケーラブルなコードブックで再構築のブレークスルーを達成しつつ、プリエンタライザのメモリと計算の制限を効率的に克服することができる。 2)生成復号(GD) 先行トークン化器とは違って、余剰ノイズ変数の先行した生成デコーダを導入する。この場合、GDは離散トークンに条件付けられた視覚データの分布を確率論的にモデル化し、特に高い圧縮比でWeTokの視覚的詳細を再構成することができる。主要なベンチマークに関する大規模な実験は、WeTokの優れたパフォーマンスを示している。 ImageNet 50kの検証セットでは、WeTokは記録的な低速ゼロショットのrFIDを達成した(WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19)。さらに, 圧縮率768のゼロショットrFIDを3.49で達成し, 圧縮率50%のコスモス(384) 4.57を上回った。コードとモデルは、https://github.com/zhuangshaobin/WeTok.comで入手できる。

論文の概要: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

関連論文リスト