Fugu-MT 論文翻訳(概要): A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

論文の概要: A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

arxiv url: http://arxiv.org/abs/2601.21633v1
Date: Thu, 29 Jan 2026 12:32:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.812387
Title: A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
Title（参考訳）: A Tilted Seesaw: 制御可能な拡散のためのオートエンコーダトレードオフの再検討
Authors: Pu Cao, Yiyang Ma, Feng Zhou, Xuedan Yin, Qing Song, Lu Yang,
Abstract要約: 潜時拡散モデルでは、オートエンコーダは通常、忠実な再構成と世代フレンドリーな潜時空間という2つの能力のバランスをとることが期待されている。近年のImageNet-scale AE研究では、このトレードオフに対処する上で、生成指標に対する体系的なバイアスが観察されている。我々は、このgFID優位な嗜好がImageNet生成に不適切であるように見える理由を分析するが、スケールが制御可能な拡散に近づくと危険になる。
参考スコア（独自算出の注目度）: 12.638580946105643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.
Abstract（参考訳）: 潜時拡散モデルでは、オートエンコーダ(AE)は、忠実な再構成と世代フレンドリーな潜時空間(例えば、低gFID)の2つの能力のバランスをとることが期待される。近年のImageNet-scale AE研究では,このトレードオフ処理における生成指標に対する体系的バイアスが観察されている。理論的には、このgFID優位な嗜好がImageNet生成に不適切に見えるが、制御可能な拡散へのスケーリングでは危険になる: AEsは条件ドリフトを誘導し、達成可能な条件アライメントを制限する。一方、再建の忠実度、特にインスタンスレベルの尺度は、制御可能性を示すのがよい。我々は、傾斜オートエンコーダの評価が制御性に与える影響を、最近の ImageNet AEs を用いて実証的に検証した。制御可能な生成タスクを反映した多次元条件ドリフト評価プロトコルを用いることで、gFIDは条件保存の弱さしか予測できないのに対して、再構成指向のメトリクスは実質的に整合性が高いことがわかった。 ControlNetの実験では、制御性はgFIDではなく条件保存を追跡する。全体としては,ImageNet中心のAE評価とスケーラブルな制御可能な拡散要件のギャップが指摘され,より信頼性の高いベンチマークとモデル選択のための実用的なガイダンスが提供される。

論文の概要: A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

関連論文リスト