OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows
- URL: http://arxiv.org/abs/2510.03506v2
- Date: Thu, 09 Oct 2025 04:05:49 GMT
- Title: OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows
- Authors: John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, Ricky T. Q. Chen,
- Abstract summary: We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation.<n>Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents.
- Score: 59.052955667723985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.
Related papers
- Trajectory Stitching for Solving Inverse Problems with Flow-Based Models [68.36374645801901]
Flow-based generative models have emerged as powerful priors for solving inverse problems.<n>We propose MS-Flow, which represents the trajectory as a sequence of intermediate latent states rather than a single initial code.<n>We demonstrate the effectiveness of MS-Flow over existing methods on image recovery and inverse problems, including inpainting, super-resolution, and computed tomography.
arXiv Detail & Related papers (2026-02-09T11:36:41Z) - RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation [12.979642182577157]
Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results.<n>We introduce RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a tailored noise-injection refinement step.<n> RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.
arXiv Detail & Related papers (2026-01-31T18:27:05Z) - NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation [66.92488610008519]
NextFlow is a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens.<n>By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow activates multimodal understanding and generation capabilities.<n>NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
arXiv Detail & Related papers (2026-01-05T15:27:04Z) - FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows [17.924626622563924]
FlowBind is an efficient framework for any-to-any generation.<n>It learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality.<n>Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods.
arXiv Detail & Related papers (2025-12-17T13:08:18Z) - AlphaFlow: Understanding and Improving MeanFlow Models [74.64465762009475]
We show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency.<n>Motivated by these insights, we introduce $alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow.<n>When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $alpha$-Flow consistently outperforms MeanFlow across scales and settings.
arXiv Detail & Related papers (2025-10-23T17:45:06Z) - Contrastive Flow Matching [61.60002028726023]
We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows.<n>Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs.<n>We find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching.
arXiv Detail & Related papers (2025-06-05T17:59:58Z) - Normalizing Flows are Capable Generative Models [48.31226028595099]
TarFlow is a simple and scalable architecture that enables highly performant NF models.<n>It is straightforward to train end-to-end, and capable of directly modeling and generating pixels.<n>TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin.
arXiv Detail & Related papers (2024-12-09T09:28:06Z) - OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows [21.677178476653385]
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis.<n>It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis.
arXiv Detail & Related papers (2024-12-02T06:13:01Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.