SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning
- URL: http://arxiv.org/abs/2510.22534v1
- Date: Sun, 26 Oct 2025 05:03:55 GMT
- Title: SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning
- Authors: Chen Chen, Majid Abdolshah, Violetta Shevchenko, Hongdong Li, Chang Xu, Pulak Purkait,
- Abstract summary: We propose a spatially re-focused super-resolution framework that refines text conditioning at inference time.<n>Second, we introduce a Spatially Targeted-Free Guidance mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations.
- Score: 59.013863248600046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.
Related papers
- AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution [16.90182090355781]
Visual autoregressive models offer stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction.<n>But their application remains underexplored and faces two critical challenges: locality-biased attention, and residual-only supervision.<n>We propose a globally consistent visual autoregressive framework tailored for image super-resolution.
arXiv Detail & Related papers (2026-02-28T10:39:06Z) - RCDN: Real-Centered Detection Network for Robust Face Forgery Identification [7.41356813669013]
Existing detection methods achieve near-perfect performance when training and testing are conducted within the same domain.<n>New forgery techniques continuously emerge and detectors must remain reliable against unseen manipulations.<n>We propose the Real-Centered Detection Network (RCDN), a frequency spatial convolutional neural networks(CNN) framework with an Xception backbone.
arXiv Detail & Related papers (2026-01-17T17:09:15Z) - Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z) - Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution [59.71803719801537]
CODSR is a controllable one-step diffusion network for image super-resolution.<n>We propose an LQ-guided feature modulation module to provide high-fidelity conditioning for the diffusion process.<n>We develop a region-adaptive generative prior activation method to effectively enhance perceptual richness.
arXiv Detail & Related papers (2025-12-16T03:56:02Z) - Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks [37.075582998671905]
Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images.<n>This paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR)<n>ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale.
arXiv Detail & Related papers (2025-11-20T13:21:58Z) - HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation [74.1872891313184]
HRSeg is an efficient model with high-resolution fine-grained perception.<n>It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE)
arXiv Detail & Related papers (2025-07-17T08:09:31Z) - Controllable Reference Guided Diffusion with Local Global Fusion for Real World Remote Sensing Image Super Resolution [9.658727475375565]
Super resolution techniques can enhance the spatial resolution of remote sensing images, enabling more efficient large scale earth observation applications.<n>Existing RefSR methods struggle with real world complexities, such as cross sensor resolution gap and significant land cover changes.<n>We propose CRefDiff, a novel controllable reference guided diffusion model for real world remote sensing image SR.
arXiv Detail & Related papers (2025-06-30T12:45:28Z) - DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z) - Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns [4.977925450373957]
We propose a novel TripleGAN framework for training super-resolution reconstruction models.<n>The framework learns real-world degradation patterns from LR observations and synthesizes datasets with corresponding degradation characteristics.<n>Our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts.
arXiv Detail & Related papers (2025-06-20T14:24:48Z) - One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation [53.24542646616045]
We propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for image super-resolution (SR) generation.<n>VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-frequency Perception (HFP) loss.<n>The proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.
arXiv Detail & Related papers (2025-06-03T08:28:13Z) - Embedding Similarity Guided License Plate Super Resolution [3.16770435670322]
This study proposes a novel framework that combines pixel-based loss with embedding similarity learning.<n>Experiments on the CCPD and PKU dataset validate the efficacy of the proposed framework.
arXiv Detail & Related papers (2025-01-02T18:42:07Z) - HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior [62.04939047885834]
We present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for Real-ISR.<n>Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed spatial-CLIP Map.
arXiv Detail & Related papers (2024-11-27T15:22:44Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.