Fugu-MT 論文翻訳(概要): TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

論文の概要: TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

arxiv url: http://arxiv.org/abs/2604.07340v1
Date: Wed, 08 Apr 2026 17:53:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.668433
Title: TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
Title（参考訳）: TC-AE: ディープ圧縮オートエンコーダのトークン容量をアンロック
Authors: Teng Li, Ziyuan Huang, Cong Chen, Yangfu Li, Yuanhuiyi Lyu, Dandan Zheng, Chunhua Shen, Jun Zhang,
Abstract要約: 我々は、深部圧縮オートエンコーダのためのViTベースのアーキテクチャであるTC-AEを提案する。トークン・ツー・ラテント圧縮を2段階に分解し,構造的情報損失を低減する。画像トークンのセマンティック構造を,共同指導による訓練によって強化し,より生成しやすい潜伏者へと導いた。
参考スコア（独自算出の注目度）: 51.71228803075235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
Abstract（参考訳）: 我々は、深部圧縮オートエンコーダのためのViTベースのアーキテクチャであるTC-AEを提案する。既存の手法では, 圧縮率の高い再設計品質を維持するために, 遅延表現のチャネル数を増大させるのが一般的である。しかし、この戦略はしばしば遅延表現の崩壊を招き、生成性能を低下させる。より複雑なアーキテクチャやマルチステージのトレーニングスキームに頼る代わりに、TC-AEは2つの補完的な革新を通じて、トークン空間、ピクセルと画像ラテントのキーブリッジの観点から、この課題に対処する。この問題に対処するため、トークン間圧縮を2段階に分解し、構造情報損失を低減し、生成のための効果的なトークン数スケーリングを可能にする。第二に、潜伏表現の崩壊をさらに緩和するために、共同自己教師による訓練により画像トークンの意味構造を強化し、より生成しやすい潜伏者を生み出す。これらの設計により、TC-AEは深部圧縮下での再現性と生成性能を大幅に改善する。われわれの研究は、視覚生成のためのViTベースのトークン化装置を進化させることを願っている。

論文の概要: TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

関連論文リスト