Related papers: No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models

No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models

URL: http://arxiv.org/abs/2509.21565v1
Date: Thu, 25 Sep 2025 20:46:48 GMT
Title: No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models
Authors: Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya,
Abstract summary: We propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations.<n>Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures.
Score: 4.511561231517167
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network's learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.

Related papers

VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training [53.09658039757408]
This paper proposes textbfnamex, a lightweight intrinsic guidance framework for efficient diffusion training.<n>name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss.<n>Experiments demonstrate that name improves both generation quality and training convergence speed compared to vanilla diffusion transformers.
arXiv Detail & Related papers (2026-01-25T13:22:38Z)
Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers [27.14203097630326]
We introduce a latent space transition operator and propose Sequential Learning with Drift Compensation.<n>SLDC aims to align feature distributions across tasks to mitigate the impact of drift.<n>Experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT.
arXiv Detail & Related papers (2025-11-13T03:40:54Z)
Learning Diffusion Models with Flexible Representation Guidance [49.26046407886349]
We present a systematic framework for incorporating representation guidance into diffusion models.<n>We introduce two new strategies for enhancing representation alignment in diffusion models.<n>Experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training.
arXiv Detail & Related papers (2025-07-11T19:29:02Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
Self Distillation via Iterative Constructive Perturbations [0.2748831616311481]
We propose a novel framework that uses a cyclic optimization strategy to concurrently optimize the model and its input data for better training.<n>By alternately altering the model's parameters to the data and the data to the model, our method effectively addresses the gap between fitting and generalization.
arXiv Detail & Related papers (2025-05-20T13:15:27Z)
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
A Simplified Framework for Contrastive Learning for Node Representations [2.277447144331876]
We investigate the potential of deploying contrastive learning in combination with Graph Neural Networks for embedding nodes in a graph. We show that the quality of the resulting embeddings and training time can be significantly improved by a simple column-wise postprocessing of the embedding matrix. This modification yields improvements in downstream classification tasks of up to 1.5% and even beats existing state-of-the-art approaches on 6 out of 8 different benchmarks.
arXiv Detail & Related papers (2023-05-01T02:04:36Z)
Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners. DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z)
Optimising for Interpretability: Convolutional Dynamic Alignment Networks [108.83345790813445]
We introduce a new family of neural network models called Convolutional Dynamic Alignment Networks (CoDA Nets) Their core building blocks are Dynamic Alignment Units (DAUs), which are optimised to transform their inputs with dynamically computed weight vectors that align with task-relevant patterns. CoDA Nets model the classification prediction through a series of input-dependent linear transformations, allowing for linear decomposition of the output into individual input contributions.
arXiv Detail & Related papers (2021-09-27T12:39:46Z)
Sparse aNETT for Solving Inverse Problems with Deep Learning [2.5234156040689237]
We propose a sparse reconstruction framework (aNETT) for solving inverse problems. We train an autoencoder network $D circ E$ with $E$ acting as a nonlinear sparsifying transform. Numerical results are presented for sparse view CT.
arXiv Detail & Related papers (2020-04-20T18:43:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.