MeanFlow Transformers with Representation Autoencoders
- URL: http://arxiv.org/abs/2511.13019v1
- Date: Mon, 17 Nov 2025 06:17:08 GMT
- Title: MeanFlow Transformers with Representation Autoencoders
- Authors: Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon,
- Abstract summary: MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
- Score: 71.45823902973349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.
Related papers
- SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows [37.7899995917052]
We find a way to fix the variance (which would otherwise be predicted by the VAE encoder) to a constant.<n>On the ImageNet $256 times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40).<n>SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
arXiv Detail & Related papers (2025-12-03T18:59:57Z) - Improved Mean Flows: On the Challenges of Fastforward Generative Models [81.10827083963655]
MeanFlow (MF) has recently been established as a framework for one-step generative modeling.<n>Here, we address key challenges in both the training objective and the guidance mechanism.<n>Our reformulation yields a more standard regression problem and improves the training stability.<n>Overall, our $textbfimproved MeanFlow$ ($textbfiMF$) method, trained entirely from scratch, achieves $textbf1.72$ FID with a single function evaluation (1-NFE) on ImageNet 256$times$256.
arXiv Detail & Related papers (2025-12-01T18:59:49Z) - Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model [53.77953728335891]
Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network.<n>We propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space.<n>This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion.
arXiv Detail & Related papers (2025-11-18T17:58:16Z) - SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization [56.12853087022071]
We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
arXiv Detail & Related papers (2025-10-06T15:57:31Z) - CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models [75.81132530657682]
Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models.<n>We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training.
arXiv Detail & Related papers (2025-09-29T09:42:08Z) - CSDformer: A Conversion Method for Fully Spike-Driven Transformer [11.852241487470797]
Spike-based transformer is a novel architecture aiming to enhance the performance of spiking neural networks.<n>We propose CSDformer, a novel conversion method for fully spike-driven transformers.<n>CSDformer achieves high performance under ultra-low latency, while dramatically reducing both computational complexity and training overhead.
arXiv Detail & Related papers (2025-09-22T07:55:03Z) - TGLF-SINN: Deep Learning Surrogate Model for Accelerating Turbulent Transport Modeling in Fusion [18.028061388104963]
We propose textbfTGLF-SINN (Spectra-Informed Neural Network) with three key innovations.<n>Our approach achieves superior performance with significantly less training data.<n>In downstream flux matching applications, our NN surrogate provides 45x speedup over TGLF while maintaining comparable accuracy.
arXiv Detail & Related papers (2025-09-07T09:36:51Z) - CoVAE: Consistency Training of Variational Autoencoders [9.358185536754537]
We propose a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture.<n>We show that CoVAE can generate high-quality samples in one or few steps without the use of a learned prior.<n>Our approach provides a unified framework for autoencoding and diffusion-style generative modeling and provides a viable route for one-step generative high-performance autoencoding.
arXiv Detail & Related papers (2025-07-12T01:32:08Z) - Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z) - A Principled Hierarchical Deep Learning Approach to Joint Image
Compression and Classification [27.934109301041595]
This work proposes a three-step joint learning strategy to guide encoders to extract features that are compact, discriminative, and amenable to common augmentations/transformations.
Tests show that our proposed method achieves accuracy improvement of up to 1.5% on CIFAR-10 and 3% on CIFAR-100 over conventional E2E cross-entropy training.
arXiv Detail & Related papers (2023-10-30T15:52:18Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.