OmniRefiner: Reinforcement-Guided Local Diffusion Refinement
- URL: http://arxiv.org/abs/2511.19990v1
- Date: Tue, 25 Nov 2025 06:57:49 GMT
- Title: OmniRefiner: Reinforcement-Guided Local Diffusion Refinement
- Authors: Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song,
- Abstract summary: VAE-based latent compression discards subtle texture information, causing identity- and attribute-specific cues to vanish.<n>We introduce ourMthd, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction.<n>Experiments demonstrate that ourMthd significantly improves reference alignment and fine-grained detail preservation.
- Score: 10.329465965964571
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.
Related papers
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment [105.31858867473845]
ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing.<n>In experiments, ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
arXiv Detail & Related papers (2025-11-25T18:40:25Z) - Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z) - EditInfinity: Image Editing with Binary-Quantized Generative Models [64.05135380710749]
We investigate the parameter-efficient adaptation of binary-quantized generative models for image editing.<n>Specifically, we propose EditInfinity, which adapts emphInfinity, a binary-quantized generative model, for image editing.<n>We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation.
arXiv Detail & Related papers (2025-10-23T05:06:24Z) - Local-Global Context-Aware and Structure-Preserving Image Super-Resolution [23.87231269881077]
Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content.<n>We propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships.
arXiv Detail & Related papers (2025-10-11T07:17:31Z) - Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models [3.3845637570565814]
We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning.<n>Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image.<n>Cross-attention with reference latents facilitates semantic alignment between the source and reference.
arXiv Detail & Related papers (2025-04-22T09:18:16Z) - From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion [98.31811240195324]
ConFill is a novel framework that reduces discrepancies between generated and original images at each diffusion step.<n>It outperforms current methods, setting a new benchmark in image completion.
arXiv Detail & Related papers (2025-04-19T13:40:46Z) - Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing [66.48853049746123]
We analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps.<n>Our method effectively minimizes distortions caused by varying text conditions during noise prediction.<n> Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios.
arXiv Detail & Related papers (2024-11-29T12:11:28Z) - Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images.
We achieve this by marrying image appearance and language understanding to generate a cognitive embedding.
To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.