Fugu-MT 論文翻訳(概要): HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

論文の概要: HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

arxiv url: http://arxiv.org/abs/2509.23736v1
Date: Sun, 28 Sep 2025 08:30:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.411312
Title: HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
Title（参考訳）: HieraTok: 画像再構成と生成を改善するマルチスケールビジュアルトケナイザ
Authors: Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen,
Abstract要約: HieraTokは、ViT(Multi-scale Vision Transformer)ベースの新しいトークンである。これらの設計を組み合わせることで、HieraTokは画像再構成と生成タスクの両方において大幅な改善を実現している。
参考スコア（独自算出の注目度）: 77.92119705470284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
Abstract（参考訳）: 本研究では,新しいマルチスケールビジョントランスフォーマ (ViT) ベースのトークンであるHieraTokを紹介し,単一スケール表現のモデル化に固有の制限を克服する。これは,(1)トークンライザエンコーダが生成するトークンマップに適用したマルチスケールダウンサンプリング,(2)低分解能なグローバルセマンティック特徴から高分解能な構造的詳細への情報の進行的なフローを可能にするスケール・カジュアルなアテンション機構,という2つの重要な設計を通じて実現されている。これらの設計を組み合わせることで、HieraTokは画像再構成と生成タスクの両方において大幅な改善を実現している。同じ設定で、このマルチスケールのビジュアルトークンは、rFID(1.47 \rightarrow 1.07$)の27.2\%の改善により、シングルスケールのトークンよりも優れている。下流生成フレームワークに統合されると、より高速な収束率とgFID(16.4 \rightarrow 13.3$)の18.9\%アップを達成する。さらに, トークン化剤のトレーニングをスケールアップすることにより, ソタ rFID が 0.45 であり, gFID が 1.82 であることを示す。我々の知る限りでは、画像再構成と画像生成にマルチスケールのViTベースのトークン化器を導入するのは初めてである。視覚生成タスクにおけるViTベースのトークン化器の進歩を期待する。

論文の概要: HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

関連論文リスト