Related papers: Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

URL: http://arxiv.org/abs/2509.21989v1
Date: Fri, 26 Sep 2025 07:11:55 GMT
Title: Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation
Authors: Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka,
Abstract summary: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
Score: 120.23172120151821
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/

Related papers

Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image.<n>We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner.<n>Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment [25.209622555403527]
We propose a framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation.<n>AVSE calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities.<n>Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets.
arXiv Detail & Related papers (2025-03-10T06:38:41Z)
Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features.<n>We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories. In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture. The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
A Graph-based Interactive Reasoning for Human-Object Interaction Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs. We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.