Fugu-MT 論文翻訳(概要): IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

論文の概要: IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

arxiv url: http://arxiv.org/abs/2606.11096v1
Date: Tue, 09 Jun 2026 16:53:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.62342
Title: IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
Title（参考訳）: IDEAL:In-depth alignmentは、離散表現オートエンコーダを作る
Authors: Yitong Chen, Zijie Diao, Junke Wang, Lingyu Kong, Yixuan Ren, Bo He, Yu-Gang Jiang, Zuxuan Wu,
Abstract要約: 離散表現自動符号化のための奥行きアライメントフレームワークであるIdealを提案する。量子化トークンを浅いVFM機能と深いVFM機能の両方に合わせることで、結果の離散的な視覚トークンを視覚的忠実性とリッチなセマンティクスの両方を保存することができる。
参考スコア（独自算出の注目度）: 74.25043153401586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.
Abstract（参考訳）: 事前学習された視覚基盤モデル(VFM)に基づいて構築された表現オートエンコーダ(RAE)は、画像生成のための意味的にリッチな潜在空間を構築するための有望なアプローチとして最近登場した。しかし、深いVFM表現は十分なきめ細かな視覚的詳細を保存できないため、その復元品質はしばしば最適以下である。この制限は、低レベル情報の欠如が回復し難い離散化後にさらに深刻になる。実際,従来のRAEの深い特徴によってもたらされる高レベルの意味を補完する,局地的外観と構造的細部が比較的豊富に保たれている。この相補的特性により、離散表現の自動符号化のための奥行きアライメントフレームワークであるIdealを提案する。量子化トークンを浅いVFMと深いVFMの両方の特徴と組み合わせることで、Idealは結果の離散的な視覚トークンを視覚的忠実性とリッチなセマンティクスの両方を保存することができる。大規模な実験により、IdealはImageNet上で0.61 rFIDを達成し、以前のベストメソッドを0.28で上回った。自己回帰画像生成に使用する場合、Idealはさらに1.89のgFIDを生成し、自己回帰画像生成のための新しい最先端技術を確立する。

論文の概要: IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

関連論文リスト