Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
- URL: http://arxiv.org/abs/2511.18673v1
- Date: Mon, 24 Nov 2025 01:13:51 GMT
- Title: Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
- Authors: Yiqing Shi, Yiren Song, Mike Zheng Shou,
- Abstract summary: Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
- Score: 55.15722080205737
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
Related papers
- Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z) - OmniRefiner: Reinforcement-Guided Local Diffusion Refinement [10.329465965964571]
VAE-based latent compression discards subtle texture information, causing identity- and attribute-specific cues to vanish.<n>We introduce ourMthd, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction.<n>Experiments demonstrate that ourMthd significantly improves reference alignment and fine-grained detail preservation.
arXiv Detail & Related papers (2025-11-25T06:57:49Z) - ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z) - EditInfinity: Image Editing with Binary-Quantized Generative Models [64.05135380710749]
We investigate the parameter-efficient adaptation of binary-quantized generative models for image editing.<n>Specifically, we propose EditInfinity, which adapts emphInfinity, a binary-quantized generative model, for image editing.<n>We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation.
arXiv Detail & Related papers (2025-10-23T05:06:24Z) - InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models [10.213390634031049]
Current IIL methods face two main challenges: a tendency towards overconfidence and difficulty in detecting subtle tampering boundaries.<n>We propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models.<n>Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions.
arXiv Detail & Related papers (2025-01-06T07:32:12Z) - Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters.
We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z) - Scaling Rectified Flow Transformers for High-Resolution Image Synthesis [22.11487736315616]
Rectified flow is a recent generative model formulation that connects data and noise in a straight line.
We improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales.
We present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities.
arXiv Detail & Related papers (2024-03-05T18:45:39Z) - Forgery-aware Adaptive Transformer for Generalizable Synthetic Image
Detection [106.39544368711427]
We study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods.
We present a novel forgery-aware adaptive transformer approach, namely FatFormer.
Our approach tuned on 4-class ProGAN data attains an average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen diffusion models with 95% accuracy.
arXiv Detail & Related papers (2023-12-27T17:36:32Z) - Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis [15.76266032768078]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.<n>We first introduce vision guidance as a foundational spatial cue within the perturbed distribution.<n>We propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - EDICT: Exact Diffusion Inversion via Coupled Transformations [13.996171129586731]
Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem.
We propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers.
EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors.
arXiv Detail & Related papers (2022-11-22T18:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.