Fugu-MT 論文翻訳(概要): AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

論文の概要: AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

arxiv url: http://arxiv.org/abs/2606.07185v1
Date: Fri, 05 Jun 2026 11:49:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.720846
Title: AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
Title（参考訳）: AdaTok: 品質を保った動的トークンによる自己予算型イメージトークン化
Authors: Xiaocheng Lu, Yuxi Chen, Jie Zhang, Jian Liu, Jingcai Guo, Fangqi Zhu, Tao Han, Song Guo,
Abstract要約: 自己予算の離散1DトークンであるAdaTokを提案する。 AdaTokは、トークンをネストしたテールマスクで注文する優先順位付き表現学習と、Adaptive Token Allocationを組み合わせたものだ。 ImageNet-1Kでは、AdaTok-Fullが256トークンでrFID 1.31に達し、AdaTok-Adaptiveは118トークンでrFID 1.50を達成する。
参考スコア（独自算出の注目度）: 39.0104982235623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation--allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.
Abstract（参考訳）: 2Dグリッドから最近の1Dシーケンスまでの画像トークンライザは、通常、すべての画像を同じ固定数のトークンでエンコードする。しかし、視覚的複雑性は非常に不均一であるため、一様予算は単純な入力をオーバースペンドし、複雑なものを保存する。既存の弾性トークン化器は、可変長の再構成を公開するが、しばしばトークン長を、トークン化器自体の出力ではなく、デプロイ時操作ポイント、検索ターゲット、外部予測として残す。本研究では、離散的な視覚的トークン化器が1パスで自己予算化できるかどうかを問う。我々の中心的な発見は、動作可能な弾力性には表現-割り当ての共設計が必要であるということだ。自己予算の離散1DトークンであるAdaTokを提案する。 AdaTokは、トークンをネストしたテールマスクで順序付けし、マルチヘッドのLoRAデコーダヘッドを通じて予算依存のセマンティックシフトを解決する優先順位付け表現学習と、候補予算よりも軽量な決定論的グループGRPOポリシーをトレーニングするAdaptive Token Allocationを組み合わせた。ダイナミックパレートウェイトリングは、手動のトレードオフのない政策トレーニングにおいて、忠実さと効率性をバランスさせる。 ImageNet-1Kでは、AdaTok-Fullが256トークンでrFID 1.31に達し、AdaTok-Adaptiveは平均118トークンでrFID 1.50を達成する。自己回帰画像生成では、短い適応表現は256の復号符号よりも2.1倍のスループットを出力し、視覚トークンカウントは固定ハイパーパラメータとして設定されるのではなく、コンテンツ条件の出力として学習することができることを示唆している。

論文の概要: AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

関連論文リスト