Prompt-Guided Dual Latent Steering for Inversion Problems
- URL: http://arxiv.org/abs/2509.18619v1
- Date: Tue, 23 Sep 2025 04:11:06 GMT
- Title: Prompt-Guided Dual Latent Steering for Inversion Problems
- Authors: Yichen Wu, Xu Liu, Chenxuan Zhao, Xinyu Wu,
- Abstract summary: Inverting corrupted images into the latent space of diffusion models is challenging.<n>Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy.<n>We introduce Prompt-Guided Dual Latent Steering (PDLS), a novel framework built upon Rectified Flow models for their stable inversion paths.<n>PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt.
- Score: 16.58915166460579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.
Related papers
- Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration [39.5698877093219]
We propose an Uncertainty-Aware Diffusion Bridge Model (UDBM) for image restoration.<n>UDBM reformulates AiOIR as a transport problem steered by pixel-wise uncertainty.<n>It achieves state-of-the-art performance across diverse restoration tasks within a single inference step.
arXiv Detail & Related papers (2026-01-29T12:02:42Z) - Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing [21.585366155855894]
We propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations.<n>We introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers.<n>Our method achieves state-of-the-art performance in terms of fidelity and editability.
arXiv Detail & Related papers (2025-09-16T09:41:14Z) - Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion [15.384896404310645]
We propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models.<n>Our method produces high-quality, semantically coherent, and structurally consistent image generations.
arXiv Detail & Related papers (2025-08-13T07:46:00Z) - Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing [21.95149344518237]
Learning-based deformable image registration (DIR) alignment accelerates by amortizing traditional optimization via neural networks.<n>We introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass.<n>Preliminary results on a retinal vessel dataset demonstrate our method reduces registration error to 1.88 pixels on 2912x2 images.
arXiv Detail & Related papers (2025-06-12T15:26:03Z) - DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing [73.12011187146481]
Inversion within Diffusion models aims to recover the latent noise representation for a real or generated image.<n>Most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility.<n>We introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image.
arXiv Detail & Related papers (2025-06-03T07:46:44Z) - FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing [47.908940130654535]
FlowAlign is an inversion-free flow-based framework for consistent image editing with optimal control-based trajectory control.<n>Our terminal point regularization is shown to balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory.<n>FlowAlign outperforms existing methods in both source preservation and editing controllability.
arXiv Detail & Related papers (2025-05-29T06:33:16Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Training-Free Layout-to-Image Generation with Marginal Attention Constraints [73.55660250459132]
We propose a training-free layout-to-image (L2I) approach, which eliminates the need for additional modules or fine-tuning.<n>Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions.<n>We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - Bidirectional Consistency Models [1.486435467709869]
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector.<n>DMs can also invert an input image to noise by moving backward along the probability flow ordinary differential equation (PF ODE)
arXiv Detail & Related papers (2024-03-26T18:40:36Z) - DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior [70.46245698746874]
We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks.
DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content.
In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results.
For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details.
arXiv Detail & Related papers (2023-08-29T07:11:52Z) - Improving Misaligned Multi-modality Image Fusion with One-stage
Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion.
We propose a Cross-modality Multi-scale Progressive Dense Registration scheme.
This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z) - Overparameterization Improves StyleGAN Inversion [66.8300251627992]
Existing inversion approaches obtain promising yet imperfect results.
We show that this allows us to obtain near-perfect image reconstruction without the need for encoders.
Our approach also retains editability, which we demonstrate by realistically interpolating between images.
arXiv Detail & Related papers (2022-05-12T18:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.