Fugu-MT 論文翻訳(概要): GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

論文の概要: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2504.08736v1
Date: Fri, 11 Apr 2025 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-21 15:52:10.39433
Title: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Title（参考訳）: GigaTok: 自動回帰画像生成のためのビジュアルトケナイザを30億のパラメータに拡張
Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu,
Abstract要約: 視覚トークン化のスケーリングにおいて、画像再構成、生成、表現学習を改善するための最初のアプローチであるGigaTokを紹介する。我々は、遅延空間の増大する複雑さを、再生と世代ジレンマの主な要因とみなす。数十億ドルのパラメータにスケールアップすることで、GigaTokは、再構築、下流のAR生成、下流のAR表現品質における最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 62.77721499671665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
Abstract（参考訳）: 自動回帰(AR)画像生成において、視覚トークン化器は画像をコンパクトな離散潜在トークンに圧縮し、次から次へと予測することで、下流の自己回帰モデルの効率的なトレーニングを可能にする。ビジュアルトークンライザのスケーリングは画像再構成の品質を改善するが、ダウンストリーム生成の品質を劣化させることが多い。そこで我々は,視覚的トークン化のスケーリングにおいて,画像再構成,生成,表現学習を同時に改善するための最初のアプローチであるGigaTokを紹介する。我々は、遅延空間の増大する複雑さを、再生と世代ジレンマの主な要因とみなす。これを軽減するために,事前に訓練されたビジュアルエンコーダから,トークン化機能と意味的に一貫した特徴とを整合させる意味正規化を提案する。この制約は、スケーリング中に過剰な遅延空間の複雑さを防止し、再構成と下流の自己回帰生成の両方で一貫した改善をもたらす。セマンティックな正規化を基盤として,(1) 拡張性向上のために1Dトークンライザを使用する,(2) エンコーダとデコーダの両方を拡張する際にデコーダのスケーリングを優先する,(3) エントロピー損失を利用して10億規模のトークンライザのトレーニングを安定化する,という3つの重要なプラクティスを探求する。 GigaTokは、$\bf{3 \space billion}$パラメータにスケールすることで、再構築、下流のAR生成、下流のAR表現品質における最先端のパフォーマンスを達成する。

論文の概要: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

関連論文リスト