Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
- URL: http://arxiv.org/abs/2602.10099v1
- Date: Tue, 10 Feb 2026 18:58:04 GMT
- Title: Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
- Authors: Amandeep Kumar, Vishal M. Patel,
- Abstract summary: We show that standard diffusion transformers fail to converge on representations directly.<n>We identify Geometric Interference as the root cause.<n>Our method RJF enables the standard DiT-B architecture to converge effectively, achieving an FID of 3.37.
- Score: 48.68968421120471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF
Related papers
- Riemannian Flow Matching for Disentangled Graph Domain Adaptation [51.98961391065951]
Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space.<n>DisRFM is a geometry-aware GDA framework that unifies embedding and flow-based transport.
arXiv Detail & Related papers (2026-01-31T11:05:35Z) - DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation [47.409626500688866]
We present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction.<n>Our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM.
arXiv Detail & Related papers (2026-01-30T12:25:34Z) - HFNO: an interpretable data-driven decomposition strategy for turbulent flows [0.0]
We present a novel FNO-based architecture tailored for reduced-order modeling of turbulent fluid flows.<n>The proposed architecture processes wavenumber bins in parallel, enabling approximation of dispersion relations and non-linear interactions.<n>We evaluate the proposed model on a series of increasingly complex dynamical systems.
arXiv Detail & Related papers (2025-11-03T12:57:19Z) - Rectified-CFG++ for Flow Based Models [26.896426878221718]
We present Rectified-C++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule.<n>Experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-C++ consistently outperforms standard CFG on benchmark datasets.
arXiv Detail & Related papers (2025-10-09T00:00:47Z) - FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems [51.15230303652732]
FLEX (F Low EXpert) is a backbone architecture for generative modeling of-temporal physical systems.<n>It reduces the variance of the velocity field in the diffusion model, which helps stabilize training.<n>It achieves accurate predictions for super-resolution and forecasting tasks using as few features as two reverse diffusion steps.
arXiv Detail & Related papers (2025-05-23T00:07:59Z) - PiT: Progressive Diffusion Transformer [50.46345527963736]
Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture.<n>We find that DiTs do not rely as heavily on global information as previously believed.<n>We propose a series of Pseudo Progressive Diffusion Transformer (PiT)
arXiv Detail & Related papers (2025-05-19T15:02:33Z) - TinyFusion: Diffusion Transformers Learned Shallow [52.96232442322824]
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization.<n>We present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning.<n>Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$times$ speedup with an FID score of 2.86.
arXiv Detail & Related papers (2024-12-02T07:05:39Z) - ET-Flow: Equivariant Flow-Matching for Molecular Conformer Generation [3.4146914514730633]
We introduce Equivariant Transformer Flow (ET-Flow) to predict low-energy molecular conformations.
Our approach results in a straightforward and scalable method that operates on all-atom coordinates with minimal assumptions.
ET-Flow significantly increases the precision and physical validity of the generated conformers, while being a lighter model and faster at inference.
arXiv Detail & Related papers (2024-10-29T16:44:10Z) - Convergence Analysis of Flow Matching in Latent Space with Transformers [7.069772598731282]
We present theoretical convergence guarantees for ODE-based generative models, specifically flow matching.
We use a pre-trained autoencoder network to map high-dimensional original inputs to a low-dimensional latent space, where a transformer network is trained to predict the velocity field of the transformation from a standard normal distribution to the target latent distribution.
arXiv Detail & Related papers (2024-04-03T07:50:53Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.