Fugu-MT 論文翻訳(概要): Image Tokenizer Needs Post-Training

論文の概要: Image Tokenizer Needs Post-Training

arxiv url: http://arxiv.org/abs/2509.12474v1
Date: Mon, 15 Sep 2025 21:38:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:52.778195
Title: Image Tokenizer Needs Post-Training
Title（参考訳）: Image Tokenizerはポストトライニングを必要とする
Authors: Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides,
Abstract要約: 本稿では,遅延空間構築と復号化に着目した新しいトークン化学習手法を提案する。具体的には,トークン化の堅牢性を大幅に向上させる,プラグアンドプレイ型トークン化学習手法を提案する。生成したトークンと再構成されたトークンの分布差を軽減するために、よく訓練された生成モデルに関するトークン化デコーダをさらに最適化する。
参考スコア（独自算出の注目度）: 76.91832192778732
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
Abstract（参考訳）: 最近の画像生成モデルは、通常、凍結された画像トークン化器に依存して、あらかじめ構築された潜在空間における画像分布をキャプチャする。しかし, 現在のトークン化者は, サンプリング中の生成誤差を考慮せずに, 生成訓練前に発生する再構成タスクのみを優先する。本稿では,この不一致の原因を離散潜在空間において包括的に分析し,そこから主学習と後学習の両方を含む新しいトークン化学習手法を提案し,それぞれ遅延空間の構築と復号化に焦点をあてる。メイントレーニングでは, ノイズのサンプリングをシミュレートする潜伏摂動戦略が提案されている。具体的には,トークン化の堅牢性を大幅に向上させ,生成品質と収束速度を向上するプラグアンドプレイトークン化学習手法と,トークン化性能を生成品質に相関させる新しいトークン化評価指標である \ie, pFID を提案する。ポストトレーニング中に、よく訓練された生成モデルに関するトークン化デコーダをさらに最適化し、生成されたトークンと再構成されたトークンの分布差を軽減する。 400M の$\sim$400M ジェネレータを用いて,提案したメイントレーニングでトレーニングした離散トークンは,注目すべき 1.60 gFID を達成するとともに,追加のポストトレーニングで 1.36 gFID を得る。さらに, 自己回帰型および拡散型ジェネレータと組み合わせて, 市販の離散型および連続型トークン化器に対するポストトレーニング戦略の有効性を広く検証する実験を行った。

論文の概要: Image Tokenizer Needs Post-Training

関連論文リスト