Fugu-MT 論文翻訳(概要): InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

論文の概要: InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2605.14333v1
Date: Thu, 14 May 2026 03:57:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.610862
Title: InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
Title（参考訳）: InsightTok:自己回帰画像生成のための離散的トークン化におけるテキストと顔の忠実度の改善
Authors: Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni, Zeyu Liu, Jiayi Guo, Lei Shi, Yue Dong, Li Chen, Ji Li, Gao Huang, Dong Chen,
Abstract要約: InsightTokは、個別の視覚的トークン化フレームワークで、ローカライズされたコンテンツ対応の知覚的損失を通じて、テキストと顔の忠実度を高める。コンパクトな16kコードブックと16倍のダウンサンプリングレートで、InsightTokはテキストや顔の再構成において、以前のトークンよりも大幅にパフォーマンスが向上した。その結果、離散画像生成を進めるためのトークン化教育における特殊監督の可能性が浮き彫りになった。
参考スコア（独自算出の注目度）: 67.8525902443746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.
Abstract（参考訳）: テキストと顔は視覚生成において最も知覚的に健全で事実上重要なパターンの1つであるが、離散トークン化に基づいて構築された自己回帰型ジェネレータは依然として困難である。攻撃的なダウンサンプリングと量子化は、しばしば読みやすいグリフと独特の顔の特徴を保持するのに必要な微細な構造を捨てる。このギャップは、テキストの正当性や顔の忠実度に弱く、汎用的な再構成を最適化し、多様なコンテンツを一様に圧縮する、という標準的な離散トークン化の目的に起因している。これを解決するためにInsightTokを提案する。これはテキストと顔の忠実度を高めるためのシンプルで効果的な離散的な視覚的トークン化フレームワークである。コンパクトな16kコードブックと16倍のダウンサンプリングレートで、InsightTokは、一般的な再構築品質を損なうことなく、テキストや顔の再構築において、先行トークンよりも大幅に性能が向上する。これらのゲインはInsightARの自己回帰画像生成に一貫して移行し、より明瞭なテキストとより忠実な顔の詳細の画像を生成する。以上の結果から,離散画像生成を推し進めるためのトークン化教育における専門監督の可能性を強調した。

論文の概要: InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

関連論文リスト