Fugu-MT 論文翻訳(概要): Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

論文の概要: Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

arxiv url: http://arxiv.org/abs/2605.06207v1
Date: Thu, 07 May 2026 13:13:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.818916
Title: Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Title（参考訳）: エントロピークリフのモデリング: 自動回帰視覚生成のための可変コードブックサイズ量子化
Authors: Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu,
Abstract要約: トレーニングセットのパーポジション条件エントロピーは、数位置の後に条件分布が本質的に決定論的になるように、シーケンスに沿って急速に減衰する。これを解決するために、可変コードブックサイズ量子化(VCQ)を提案し、そこでは、コードブックサイズ$K_t$がシーケンスに沿って単調に成長する。バニラ自己回帰変換器と標準的な次世代予測により、VCQのベースバージョンはImageNet上でgFID w/oGを27.98から14.80に削減する。
参考スコア（独自算出の注目度）: 21.427403915969872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
Abstract（参考訳）: ほとんどの離散的なビジュアルトークンライザはデフォルトの設計に依存しており、シーケンス内のすべての位置は同じコードブックを共有している。研究者たちは、コードブックのサイズをK$に拡大して、再構築のパフォーマンスを向上しようと試みている。このような定型コードブックの設計は、基本的な情報理論の限界に達する。トレーニングセットのパーポジション条件エントロピーは、数位置の後に条件分布が本質的に決定論的になるように、シーケンスに沿って急速に減衰する。 $K=16384$のImageNetでは、256のポジションのうち2つしか発生しないため、残りの254が記憶障害となる。この現象をエントロピー・クリフと呼び、簡単な式で定式化する: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$。興味深いことに、この現象は言語では見られず、その自然構造はコードブックの容量よりはるかに低い位置における有効エントロピーを保っている。そこでは、コードブックサイズが$K_{\min}=2$から$K_{\max}$へと単調に成長し、損失関数、パラメータ数、ARトレーニング手順が変わらないように、可変コードブックサイズ量子化(VCQ)を提案する。バニラオートレグレッシブトランスフォーマーと標準的な次世代予測により、VCQのベースバージョンは、ベースライン上のImageNet $256\times256$で、gFID w/o CFGを27.98から14.80に削減する。スケールアップすると、684万の自己回帰パラメータを持つgFID 1.71に到達し、セマンティック正規化や因果アライメントのような追加のトレーニングテクニックは不要である。 K_{\min}=2$の極度の情報ボトルネックは、粗いセマンティック階層を自然に引き起こす: 最初の10トークンのみの線形プローブはImageNet上で43.8%の精度に達し、均一なコードブックでは27.1%である。最終的に、これらの結果は、重要なことはコードブックの総容量だけでなく、その容量が分散して組織化されていることを示しています。

論文の概要: Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

関連論文リスト