A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
- URL: http://arxiv.org/abs/2601.21633v1
- Date: Thu, 29 Jan 2026 12:32:47 GMT
- Title: A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
- Authors: Pu Cao, Yiyang Ma, Feng Zhou, Xuedan Yin, Qing Song, Lu Yang,
- Abstract summary: In latent diffusion models, the autoencoder is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space.<n>In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off.<n>We analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion.
- Score: 12.638580946105643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.
Related papers
- Knowledge-Embedded and Hypernetwork-Guided Few-Shot Substation Meter Defect Image Generation Method [0.0]
Substation meters play a critical role in monitoring and ensuring the stable operation of power grids.<n>Their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples.<n>We propose a novel framework that integrates Conditional Knowledge Embedding and Hypernetwork-Guided Control into a Stable Diffusion pipeline.
arXiv Detail & Related papers (2026-01-14T07:21:57Z) - Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework [27.71603877164877]
Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing.<n>We propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection.
arXiv Detail & Related papers (2025-12-27T03:16:48Z) - Uncertainty-Guided Selective Adaptation Enables Cross-Platform Predictive Fluorescence Microscopy [65.15943255667733]
We introduce Subnetwork Image Translation ADDA with automatic depth selection (SIT-ADDA-Auto)<n>We show that adapting only the earliest convolutional layers, while freezing deeper layers, yields reliable transfer.<n>Our results provide a design rule for label-free adaptation in microscopy and a recipe for field settings; the code is publicly available.
arXiv Detail & Related papers (2025-11-15T03:01:05Z) - Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection [58.535473924035365]
Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns.<n>We tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs.<n>We introduce a corruption model that injects artificial disruptions into training images to mimic structural defects.
arXiv Detail & Related papers (2025-11-10T15:48:50Z) - ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z) - Anomaly Detection via Autoencoder Composite Features and NCE [1.2891210250935148]
Autoencoders (AEs) or generative models are often employed to model the data distribution of normal inputs.<n>We propose a decoupled training approach for anomaly detection that both an AE and a likelihood model trained with noise contrastive estimation (NCE)
arXiv Detail & Related papers (2025-02-04T01:29:22Z) - Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection [2.3020018305241337]
Industrial anomaly detection is crucial for quality control and predictive maintenance.
Existing methods commonly detect structural anomalies, such as dents and scratches, by leveraging multi-scale features from image patches extracted through deep pre-trained networks.
We address these limitations by focusing on Deep Feature Reconstruction (DFR), a memory- and compute-efficient approach for detecting structural anomalies.
We further enhance DFR into a unified framework, called ULSAD, which is capable of detecting both structural and logical anomalies.
arXiv Detail & Related papers (2024-10-21T17:56:47Z) - Self-Supervised Masked Convolutional Transformer Block for Anomaly
Detection [122.4894940892536]
We present a novel self-supervised masked convolutional transformer block (SSMCTB) that comprises the reconstruction-based functionality at a core architectural level.
In this work, we extend our previous self-supervised predictive convolutional attentive block (SSPCAB) with a 3D masked convolutional layer, a transformer for channel-wise attention, as well as a novel self-supervised objective based on Huber loss.
arXiv Detail & Related papers (2022-09-25T04:56:10Z) - Be Your Own Neighborhood: Detecting Adversarial Example by the
Neighborhood Relations Built on Self-Supervised Learning [64.78972193105443]
This paper presents a novel AE detection framework, named trustworthy for predictions.
performs the detection by distinguishing the AE's abnormal relation with its augmented versions.
An off-the-shelf Self-Supervised Learning (SSL) model is used to extract the representation and predict the label.
arXiv Detail & Related papers (2022-08-31T08:18:44Z) - Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold.
We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples.
We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z) - On the Robustness of Quality Measures for GANs [136.18799984346248]
This work evaluates the robustness of quality measures of generative models such as Inception Score (IS) and Fr'echet Inception Distance (FID)
We show that such metrics can also be manipulated by additive pixel perturbations.
arXiv Detail & Related papers (2022-01-31T06:43:09Z) - Interpreting Rate-Distortion of Variational Autoencoder and Using Model
Uncertainty for Anomaly Detection [5.491655566898372]
We build a scalable machine learning system for unsupervised anomaly detection via representation learning.
We revisit VAE from the perspective of information theory to provide some theoretical foundations on using the reconstruction error.
We show empirically the competitive performance of our approach on benchmark datasets.
arXiv Detail & Related papers (2020-05-05T00:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.