Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models
- URL: http://arxiv.org/abs/2506.20832v1
- Date: Wed, 25 Jun 2025 21:00:44 GMT
- Title: Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models
- Authors: Cansu Korkmaz, Ahmet Murat Tekalp, Zafer Dogan,
- Abstract summary: This paper introduces a robust framework for identifying the most trustworthy SR sample from a diffusion-generated set.<n>We propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on semantic similarity.<n>By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
- Score: 0.026861992804651083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
Related papers
- One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation [53.24542646616045]
We propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for image super-resolution (SR) generation.<n>VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-frequency Perception (HFP) loss.<n>The proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.
arXiv Detail & Related papers (2025-06-03T08:28:13Z) - Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification.<n>We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains.<n>This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z) - A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal [68.0573194557999]
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem.<n>We propose a novel Deep Exclusion unfolding Network (DExNet) for SIRR.<n>DExNet is constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm.
arXiv Detail & Related papers (2025-03-03T07:54:27Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Trustworthy SR: Resolving Ambiguity in Image Super-resolution via
Diffusion Models and Human Feedback [5.665865832321032]
Super-resolution (SR) is an ill-posed inverse problem with a large set of feasible solutions that are consistent with a given low-resolution image.
We propose employing human feedback, where we ask human subjects to select a small number of likely samples and we ensemble the averages of selected samples.
Our proposed strategy provides more trustworthy solutions when compared to state-of-the art SR methods.
arXiv Detail & Related papers (2024-02-12T11:55:02Z) - Perception-Distortion Trade-off in the SR Space Spanned by Flow Models [21.597478894658263]
Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space.
We present a simple but effective image ensembling/fusion approach to obtain a single SR image eliminating random artifacts and improving fidelity without significantly compromising perceptual quality.
arXiv Detail & Related papers (2022-09-18T13:12:21Z) - Quality Assessment of Image Super-Resolution: Balancing Deterministic
and Statistical Fidelity [14.586878663223832]
We look at the problem of SR image quality assessment (SR IQA) in a two-dimensional (2D) space of deterministic fidelity (DF) versus statistical fidelity (SF)
We propose an uncertainty weighting scheme that merges the two fidelity measures into an overall quality prediction named the Super Resolution Image Fidelity (SRIF) index.
arXiv Detail & Related papers (2022-07-15T02:09:17Z) - Learning Resolution-Adaptive Representations for Cross-Resolution Person
Re-Identification [49.57112924976762]
Cross-resolution person re-identification problem aims to match low-resolution (LR) query identity images against high resolution (HR) gallery images.
It is a challenging and practical problem since the query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras.
This paper explores an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric, which is adaptive to the resolution of a query image.
arXiv Detail & Related papers (2022-07-09T03:49:51Z) - When Relation Networks meet GANs: Relation GANs with Triplet Loss [110.7572918636599]
Training stability is still a lingering concern of generative adversarial networks (GANs)
In this paper, we explore a relation network architecture for the discriminator and design a triplet loss which performs better generalization and stability.
Experiments on benchmark datasets show that the proposed relation discriminator and new loss can provide significant improvement on variable vision tasks.
arXiv Detail & Related papers (2020-02-24T11:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.