Related papers: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

URL: http://arxiv.org/abs/2403.03206v1
Date: Tue, 5 Mar 2024 18:45:39 GMT
Title: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Authors: Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M\"uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach
Abstract summary: Rectified flow is a recent generative model formulation that connects data and noise in a straight line. We improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. We present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities.
Score: 22.11487736315616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Related papers

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z)
Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z)
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z)
Dataset Distillation with Probabilistic Latent Features [9.318549327568695]
A compact set of synthetic data can effectively replace the original dataset in downstream classification tasks.<n>We propose a novel approach that models the joint distribution of latent features.<n>Our method achieves state-of-the-art cross architecture performance across a range of backbone architectures.
arXiv Detail & Related papers (2025-05-10T13:53:49Z)
Efficient Flow Matching using Latent Variables [9.363347684114474]
We show that $texttLatent-CFM$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models.<n>We also consider generative modeling of spatial fields stemming from physical processes.
arXiv Detail & Related papers (2025-05-07T14:59:23Z)
Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling [25.705179111920806]
This work addresses the question of why and when diffusion models excel at learning high-quality representations in a self-supervised manner. We develop a mathematical framework based on a low-dimensional data model and posterior estimation, revealing a fundamental trade-off between generation and representation quality near the final stage of image generation. Building on these insights, we propose an ensemble method that aggregates features across noise levels, significantly improving both clean performance and robustness under label noise.
arXiv Detail & Related papers (2025-02-09T01:58:28Z)
Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters. We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z)
Time Step Generating: A Universal Synthesized Deepfake Image Detector [0.4488895231267077]
We propose a universal synthetic image detector Time Step Generating (TSG) TSG does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.
arXiv Detail & Related papers (2024-11-17T09:39:50Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models. We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability. We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images. Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z)
ReNoise: Real Image Inversion Through Iterative Noising [62.96073631599749]
We introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models.
arXiv Detail & Related papers (2024-03-21T17:52:08Z)
Stage-by-stage Wavelet Optimization Refinement Diffusion Model for Sparse-View CT Reconstruction [14.037398189132468]
We present an innovative approach named the Stage-by-stage Wavelet Optimization Refinement Diffusion (SWORD) model for sparse-view CT reconstruction. Specifically, we establish a unified mathematical model integrating low-frequency and high-frequency generative models, achieving the solution with optimization procedure. Our method rooted in established optimization theory, comprising three distinct stages, including low-frequency generation, high-frequency refinement and domain transform.
arXiv Detail & Related papers (2023-08-30T10:48:53Z)
Training on Thin Air: Improve Image Classification with Generated Data [28.96941414724037]
Diffusion Inversion is a simple yet effective method to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion. We identify three key components that allow our generated images to successfully supplant the original dataset.
arXiv Detail & Related papers (2023-05-24T16:33:02Z)
Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis. Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches. We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
High-Fidelity Synthesis with Disentangled Representation [60.19657080953252]
We propose an Information-Distillation Generative Adrial Network (ID-GAN) for disentanglement learning and high-fidelity synthesis. Our method learns disentangled representation using VAE-based models, and distills the learned representation with an additional nuisance variable to the separate GAN-based generator for high-fidelity synthesis. Despite the simplicity, we show that the proposed method is highly effective, achieving comparable image generation quality to the state-of-the-art methods using the disentangled representation.
arXiv Detail & Related papers (2020-01-13T14:39:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.