Fugu-MT 論文翻訳(概要): Diffusion Transformers with Representation Autoencoders

論文の概要: Diffusion Transformers with Representation Autoencoders

arxiv url: http://arxiv.org/abs/2510.11690v1
Date: Mon, 13 Oct 2025 17:51:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.494901
Title: Diffusion Transformers with Representation Autoencoders
Title（参考訳）: 表現オートエンコーダを用いた拡散変換器
Authors: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie,
Abstract要約: 事前学習された自己エンコーダが拡散過程の遅延空間に画素をマッピングする潜在生成モデリングは拡散変換器(DiT)の標準戦略となっている。ほとんどのDiTはオリジナルのVAEエンコーダに依存しており、いくつかの制限が課されている。本研究では、VAEをトレーニングされたデコーダと組み合わせた事前訓練された表現エンコーダに置き換え、Representation Autoencoders (RAE) と呼ぶものを形成する。
参考スコア（独自算出の注目度）: 35.43400861279246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
Abstract（参考訳）: 事前訓練されたオートエンコーダが拡散過程のためにピクセルを潜在空間にマッピングする潜在生成モデリングは拡散変換器(DiT)の標準戦略となっているが、オートエンコーダコンポーネントはほとんど進化していない。アーキテクチャの単純さを損なう時代遅れのバックボーン、情報容量を制限する低次元の潜伏空間、純粋に再構成ベースのトレーニングと最終的に生成品質を制限した弱い表現などである。本研究では、VAEを事前訓練された表現エンコーダ(例えば、DINO、SigLIP、MAE)に置き換え、トレーニングされたデコーダと組み合わせ、表現オートエンコーダ(RAE)と呼ぶものを形成する。これらのモデルは高品質な再構成とセマンティックにリッチな潜在空間の両方を提供し、スケーラブルなトランスフォーマーベースのアーキテクチャを実現する。これらの潜伏空間は典型的には高次元であるため、拡散変換器が内部で効果的に動作できるようにするのが重要な課題である。この困難の原因を分析し、理論的に動機づけられた解を提案し、それらを実証的に検証する。提案手法は, 補助的なアライメントアライメント損失を伴わずに, より高速な収束を実現する。軽量で幅の広いDDTヘッドを備えたDiT変種を用いて、画像Netでは、256x256で 1.51 FID(ガイダンスなし)、256x256と512x512で 1.13 FID(ガイダンス付き)という強力な画像生成結果が得られる。 RAEは明確な利点があり、拡散トランスフォーマートレーニングの新たなデフォルトとなるべきである。

論文の概要: Diffusion Transformers with Representation Autoencoders

関連論文リスト