DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
- URL: http://arxiv.org/abs/2412.14580v1
- Date: Thu, 19 Dec 2024 07:00:03 GMT
- Title: DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
- Authors: Yiren Song, Xiaokang Liu, Mike Zheng Shou,
- Abstract summary: This paper introduces the DiffSim method to measure visual similarity in generative models.<n>By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity.<n>We also introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance.
- Score: 19.989551230170584
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.
Related papers
- Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image.
We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner.
Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z) - BELE: Blur Equivalent Linearized Estimator [0.8192907805418581]
This paper introduces a novel parametric model that separates perceptual effects due to strong edge degradations from those caused by texture distortions.
The first is the Blur Equivalent Linearized Estimator, designed to measure blur on strong and isolated edges.
The second is a Complex Peak Signal-to-Noise Ratio, which evaluates distortions affecting texture regions.
arXiv Detail & Related papers (2025-03-01T14:19:08Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Foundation Models Boost Low-Level Perceptual Similarity Metrics [6.226609932118124]
For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network.
This work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics.
arXiv Detail & Related papers (2024-09-11T22:32:12Z) - On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We propose an aligned training approach to enhance the similarity between internal representations.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - COSE: A Consistency-Sensitivity Metric for Saliency on Image
Classification [21.3855970055692]
We present a set of metrics that utilize vision priors to assess the performance of saliency methods on image classification tasks.
We show that although saliency methods are thought to be architecture-independent, most methods could better explain transformer-based models over convolutional-based models.
arXiv Detail & Related papers (2023-09-20T01:06:44Z) - CorrEmbed: Evaluating Pre-trained Model Image Similarity Efficacy with a
Novel Metric [6.904776368895614]
We evaluate the viability of the image embeddings from pre-trained computer vision models using a novel approach named CorrEmbed.
Our approach computes the correlation between distances in image embeddings and distances in human-generated tag vectors.
Our method also identifies deviations from this pattern, providing insights into how different models capture high-level image features.
arXiv Detail & Related papers (2023-08-30T16:23:07Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.
We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - Duality-Induced Regularizer for Semantic Matching Knowledge Graph
Embeddings [70.390286614242]
We propose a novel regularizer -- namely, DUality-induced RegulArizer (DURA) -- which effectively encourages the entities with similar semantics to have similar embeddings.
Experiments demonstrate that DURA consistently and significantly improves the performance of state-of-the-art semantic matching models.
arXiv Detail & Related papers (2022-03-24T09:24:39Z) - IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques.
IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features.
We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.