Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot
Classification via Stable Diffusion
- URL: http://arxiv.org/abs/2302.03298v4
- Date: Mon, 17 Apr 2023 01:00:24 GMT
- Title: Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot
Classification via Stable Diffusion
- Authors: Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, Clinton
Fookes
- Abstract summary: Model-Agnostic Zero-Shot Classification (MA-ZSC) refers to training non-specific classification architectures to classify real images without using any real images during training.
Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC.
We propose modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity.
- Score: 22.237426507711362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the problem of Model-Agnostic Zero-Shot
Classification (MA-ZSC), which refers to training non-specific classification
architectures (downstream models) to classify real images without using any
real images during training. Recent research has demonstrated that generating
synthetic training images using diffusion models provides a potential solution
to address MA-ZSC. However, the performance of this approach currently falls
short of that achieved by large-scale vision-language models. One possible
explanation is a potential significant domain gap between synthetic and real
images. Our work offers a fresh perspective on the problem by providing initial
insights that MA-ZSC performance can be improved by improving the diversity of
images in the generated dataset. We propose a set of modifications to the
text-to-image generation process using a pre-trained diffusion model to enhance
diversity, which we refer to as our $\textbf{bag of tricks}$. Our approach
shows notable improvements in various classification architectures, with
results comparable to state-of-the-art models such as CLIP. To validate our
approach, we conduct experiments on CIFAR10, CIFAR100, and EuroSAT, which is
particularly difficult for zero-shot classification due to its satellite image
domain. We evaluate our approach with five classification architectures,
including ResNet and ViT. Our findings provide initial insights into the
problem of MA-ZSC using diffusion models. All code will be available on GitHub.
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Improving Few-shot Image Generation by Structural Discrimination and
Textural Modulation [10.389698647141296]
Few-shot image generation aims to produce plausible and diverse images for one category given a few images from this category.
Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients.
This paper proposes a novel mechanism to inject external semantic signals into internal local representations.
arXiv Detail & Related papers (2023-08-30T16:10:21Z) - Diffusion Models Beat GANs on Image Classification [37.70821298392606]
Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc.
We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification.
We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods for classification tasks.
arXiv Detail & Related papers (2023-07-17T17:59:40Z) - DifFSS: Diffusion Model for Few-Shot Semantic Segmentation [24.497112957831195]
This paper presents the first work to leverage the diffusion model for FSS task, called DifFSS.
DifFSS, a novel FSS paradigm, can further improve the performance of the state-of-the-art FSS models by a large margin without modifying their network structure.
arXiv Detail & Related papers (2023-07-03T06:33:49Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - High-resolution semantically-consistent image-to-image translation [0.0]
This paper proposes an unsupervised domain adaptation model that preserves semantic consistency and per-pixel quality for the images during the style-transferring phase.
The proposed model shows substantial performance gain compared to the SemI2I model and reaches similar results as the state-of-the-art CyCADA model.
arXiv Detail & Related papers (2022-09-13T19:08:30Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.