Fugu-MT 論文翻訳(概要): Representation Alignment for Just Image Transformers is not Easier than You Think

論文の概要: Representation Alignment for Just Image Transformers is not Easier than You Think

arxiv url: http://arxiv.org/abs/2603.14366v1
Date: Sun, 15 Mar 2026 13:08:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.773215
Title: Representation Alignment for Just Image Transformers is not Easier than You Think
Title（参考訳）: 画像変換器の表現アライメントは、想像以上に簡単ではない
Authors: Jaeyo Shin, Jiwook Kim, Hyunjung Shim,
Abstract要約: Representation Alignment (REPA) は、潜時空間における拡散変換器の訓練を加速する簡単な方法として登場した。本稿では、Just Image Transformers (JiT) に対してREPAがフェール可能であることを示す。我々は,Masked Transformer Adapter を用いてアライメントターゲットと制約アライメントを変換する PixelREPA を提案する。
参考スコア（独自算出の注目度）: 25.669017380539064
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $> 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at https://github.com/kaist-cvml/PixelREPA.
Abstract（参考訳）: Representation Alignment (REPA) は、潜時空間における拡散変換器の訓練を加速する簡単な方法として登場した。同時に、Just Image Transformer (JiT) などの画素空間拡散変換器は、事前訓練されたトークン化器への依存を取り除き、遅延拡散の再構成ボトルネックを回避するため、注目されている。本稿では,REPAがJITでフェール可能であることを示す。 REPAは、トレーニングが進み、ImageNet上の事前訓練されたセマンティックエンコーダの表現空間に密集したイメージサブセットの多様性が崩壊するにつれて、JITのFIDが悪化する。我々は,高次元画像空間においてデノイングが発生し,セマンティックターゲットが強く圧縮され,直接回帰がショートカットの対象となる情報非対称性の失敗を辿った。我々は,浅層トランスフォーマーアダプタと部分トークンマスキングを組み合わせたMasked Transformer Adapterを用いてアライメントターゲットと制約アライメントを変換するPixelREPAを提案する。 PixelREPAはトレーニングコンバージェンスと最終品質の両方を改善している。 PixelREPAは、JIT-B$/16$でFIDを3.66ドルから3.17ドルに減らし、ImageNetで275.1ドルから284.6ドルに改善した。最後に、PixelREPA-H$/16$はFID$=1.81$とIS$=317.2$を達成する。私たちのコードはhttps://github.com/kaist-cvml/PixelREPAで利用可能です。

論文の概要: Representation Alignment for Just Image Transformers is not Easier than You Think

関連論文リスト