Fugu-MT 論文翻訳(概要): Improved Baselines with Representation Autoencoders

論文の概要: Improved Baselines with Representation Autoencoders

arxiv url: http://arxiv.org/abs/2605.18324v1
Date: Mon, 18 May 2026 12:42:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.609469
Title: Improved Baselines with Representation Autoencoders
Title（参考訳）: 表現オートエンコーダによるベースラインの改良
Authors: Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie,
Abstract要約: 表現オートエンコーダ(RAE)は、従来のVAEを事前訓練された視覚エンコーダに置き換える。 RAEを単純化し、改善する3つの洞察が得られます。 RAEv2はオリジナルのRAEよりも10倍以上早く収束する。
参考スコア（独自算出の注目度）: 61.47127824064028
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.
Abstract（参考訳）: 表現オートエンコーダ(RAE)は、従来のVAEを事前訓練された視覚エンコーダに置き換える。本稿では, 設計選択を体系的に検討し, RAEを簡素化し, 改善する3つの知見を見出す。まず、表現が最終層だけでなく、最後の k エンコーダ層の和として定義される一般化された定式化について検討する。この単純な変更は、エンコーダの微調整や特別なデータ(例えば、テキスト、顔)を使わずに、再構築を大幅に改善する。第二に、RAE(エンコーダとして事前訓練された表現を使用)が表現アライメント(REPA)を置き換え、代わりに中間層に同じ表現を蒸留するという仮定が一般的である。 RAEとREPAは相補的な動作機構を示し、エンコーダと中間拡散層の両方のターゲットとして同じ表現を使用できる。最後に、RAEは分類器フリーガイダンス (CFG) に苦慮し、AutoGuidance (AG) のための第二の弱い拡散モデルを訓練する必要がある。我々は REPA 自体をRAE 潜在空間における x-述語とみなすことができることを示す。 DiTモデルの出力を再パラメータ化することで、"free"のガイダンスを提供することができる。全体として、RAEv2はオリジナルのRAEよりも10倍以上早く収束し、ImageNet-256の80エポックで1.06の最先端のgFIDを実現している。 FDr^k では、RAEv2 は前回の3.26 (800 epochs) と比較してたった80 epochsで最先端の2.17 を達成する。これはトレーニング効率の尺度としてEP_FID@k(未ガイドのgFID <= kに到達する時期)を動機付けている。 RAEv2は35エポックのEP_FID@2を獲得し、オリジナルのRAEは177である。また、テキスト・ツー・イメージ・ジェネレーションとナビゲーション・ワールド・モデルのための多様な設定にまたがってアプローチを検証することで、一貫した改善を示す。コードはhttps://raev2.github.io.comで入手できる。

論文の概要: Improved Baselines with Representation Autoencoders

関連論文リスト