Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
- URL: http://arxiv.org/abs/2509.12888v1
- Date: Tue, 16 Sep 2025 09:41:14 GMT
- Title: Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
- Authors: Weiming Chen, Zhihan Zhu, Yijia Wang, Zhihai He,
- Abstract summary: We propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations.<n>We introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers.<n>Our method achieves state-of-the-art performance in terms of fidelity and editability.
- Score: 21.585366155855894
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
Related papers
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z) - Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation [17.043633726365233]
Current deep dehazing methods only focus on removing haze from hazy images, lacking the capability to translate between hazy and haze-free images.<n>We propose a residual-based efficient bidirectional diffusion model (RBDM) that can model the conditional distributions for both dehazing and haze generation.<n>Our RBDM successfully implements size-agnostic bidirectional transitions between haze-free and hazy images with only 15 sampling steps.
arXiv Detail & Related papers (2025-08-15T01:00:15Z) - Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution [7.920423405957888]
We propose a novel single-step diffusion approach designed to enhance both efficiency and visual quality in RSISR tasks.<n>The proposed LCMSR reduces the iterative steps of traditional diffusion models from 50-1000 or more to just a single step.<n> Experimental results demonstrate that LCMSR effectively balances efficiency and performance, achieving inference times comparable to non-diffusion models.
arXiv Detail & Related papers (2025-03-25T09:56:21Z) - EDiT: Efficient Diffusion Transformers with Linear Compressed Attention [11.36660486878447]
Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images.<n>This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs.
arXiv Detail & Related papers (2025-03-20T21:58:45Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.<n>First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.<n>Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z) - Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations [41.87051958934507]
This paper addresses two key tasks: (i) inversion and (ii) editing of a real image using rectified flow models (such as Flux)
Our inversion method allows for state-of-the-art performance in zero-shot inversion and editing, outperforming prior works in stroke-to-image synthesis and semantic image editing.
arXiv Detail & Related papers (2024-10-14T17:56:24Z) - Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR)
In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks.
We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z) - FlowIE: Efficient Image Enhancement via Rectified Flow [71.6345505427213]
FlowIE is a flow-based framework that estimates straight-line paths from an elementary distribution to high-quality images.
Our contributions are rigorously validated through comprehensive experiments on synthetic and real-world datasets.
arXiv Detail & Related papers (2024-06-01T17:29:29Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.