Diffusion Classifiers Understand Compositionality, but Conditions Apply
- URL: http://arxiv.org/abs/2505.17955v2
- Date: Thu, 29 May 2025 17:59:50 GMT
- Title: Diffusion Classifiers Understand Compositionality, but Conditions Apply
- Authors: Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach,
- Abstract summary: We present a study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks.<n>Specifically, we study three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks.<n>We also introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves.
- Score: 35.37894720627495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity.<n>We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models.<n>We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition [43.01467525231004]
We introduce DiffAugment -- a method which augments the tail classes in the linguistic space by making use of WordNet.
We demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes.
We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings.
arXiv Detail & Related papers (2024-01-01T21:20:43Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - Stable Diffusion for Data Augmentation in COCO and Weed Datasets [5.81198182644659]
Generative models have increasingly impacted various tasks, from computer vision to interior design and beyond. Stable Diffusion, a powerful diffusion model, enables the creation of high-resolution images with intricate details from text prompts or reference images.<n>This study explores the effectiveness of Stable Diffusion by evaluating seven common categories and three widespread weed species.<n> Promising results were achieved for certain classes, demonstrating the potential of Stable Diffusion in enhancing image-sparse datasets.
arXiv Detail & Related papers (2023-12-07T02:23:32Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Your Diffusion Model is Secretly a Zero-Shot Classifier [90.40799216880342]
We show that density estimates from large-scale text-to-image diffusion models can be leveraged to perform zero-shot classification.
Our generative approach to classification attains strong results on a variety of benchmarks.
Our results are a step toward using generative over discriminative models for downstream tasks.
arXiv Detail & Related papers (2023-03-28T17:59:56Z) - DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery [20.787180028571694]
DiffusionSeg is a synthesis-exploitation framework containing two-stage strategies.
We synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first stage.
In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features.
arXiv Detail & Related papers (2023-03-17T07:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.