Related papers: Prompt-Guided Dual Latent Steering for Inversion Problems

Prompt-Guided Dual Latent Steering for Inversion Problems

URL: http://arxiv.org/abs/2509.18619v1
Date: Tue, 23 Sep 2025 04:11:06 GMT
Title: Prompt-Guided Dual Latent Steering for Inversion Problems
Authors: Yichen Wu, Xu Liu, Chenxuan Zhao, Xinyu Wu,
Abstract summary: Inverting corrupted images into the latent space of diffusion models is challenging.<n>Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy.<n>We introduce Prompt-Guided Dual Latent Steering (PDLS), a novel framework built upon Rectified Flow models for their stable inversion paths.<n>PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt.
Score: 16.58915166460579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.

Related papers

Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration [39.5698877093219]
We propose an Uncertainty-Aware Diffusion Bridge Model (UDBM) for image restoration.<n>UDBM reformulates AiOIR as a transport problem steered by pixel-wise uncertainty.<n>It achieves state-of-the-art performance across diverse restoration tasks within a single inference step.
arXiv Detail & Related papers (2026-01-29T12:02:42Z)
Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing [21.585366155855894]
We propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations.<n>We introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers.<n>Our method achieves state-of-the-art performance in terms of fidelity and editability.
arXiv Detail & Related papers (2025-09-16T09:41:14Z)
Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion [15.384896404310645]
We propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models.<n>Our method produces high-quality, semantically coherent, and structurally consistent image generations.
arXiv Detail & Related papers (2025-08-13T07:46:00Z)
Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing [21.95149344518237]
Learning-based deformable image registration (DIR) alignment accelerates by amortizing traditional optimization via neural networks.<n>We introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass.<n>Preliminary results on a retinal vessel dataset demonstrate our method reduces registration error to 1.88 pixels on 2912x2 images.
arXiv Detail & Related papers (2025-06-12T15:26:03Z)
DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing [73.12011187146481]
Inversion within Diffusion models aims to recover the latent noise representation for a real or generated image.<n>Most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility.<n>We introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image.
arXiv Detail & Related papers (2025-06-03T07:46:44Z)
FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing [47.908940130654535]
FlowAlign is an inversion-free flow-based framework for consistent image editing with optimal control-based trajectory control.<n>Our terminal point regularization is shown to balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory.<n>FlowAlign outperforms existing methods in both source preservation and editing controllability.
arXiv Detail & Related papers (2025-05-29T06:33:16Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Training-Free Layout-to-Image Generation with Marginal Attention Constraints [73.55660250459132]
We propose a training-free layout-to-image (L2I) approach, which eliminates the need for additional modules or fine-tuning.<n>Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions.<n>We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z)
Bidirectional Consistency Models [1.486435467709869]
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector.<n>DMs can also invert an input image to noise by moving backward along the probability flow ordinary differential equation (PF ODE)
arXiv Detail & Related papers (2024-03-26T18:40:36Z)
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior [70.46245698746874]
We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details.
arXiv Detail & Related papers (2023-08-29T07:11:52Z)
Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion. We propose a Cross-modality Multi-scale Progressive Dense Registration scheme. This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z)
Overparameterization Improves StyleGAN Inversion [66.8300251627992]
Existing inversion approaches obtain promising yet imperfect results. We show that this allows us to obtain near-perfect image reconstruction without the need for encoders. Our approach also retains editability, which we demonstrate by realistically interpolating between images.
arXiv Detail & Related papers (2022-05-12T18:42:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.