Related papers: Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

URL: http://arxiv.org/abs/2404.03144v1
Date: Thu, 4 Apr 2024 01:34:36 GMT
Title: Diverse and Tailored Image Generation for Zero-shot Multi-label Classification
Authors: Kaixin Zhang, Zhixiang Yuan, Tao Huang,
Abstract summary: zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. We propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels.
Score: 3.354528906571718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance diversity in the generated images, we leverage a pre-trained large language model to generate diverse prompts. Employing a pre-trained multi-modal CLIP model as a discriminator, we assess whether the generated images accurately represent the target classes. This enables automatic filtering of inaccurately generated images, preserving classifier accuracy. To refine text prompts for more precise and effective multi-label object generation, we introduce a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. Additionally, to enhance visual features on the target task while maintaining the generalization of original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing global dependencies between multiple objects more effectively. Extensive experimental results validate the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods.

Related papers

GMAIL: Generative Modality Alignment for generated Image Learning [51.071351994330605]
We propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images.<n>Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments.
arXiv Detail & Related papers (2026-02-17T05:40:25Z)
GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z)
A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model [12.15621649989295]
A generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations.<n>Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA)
arXiv Detail & Related papers (2025-11-03T08:02:48Z)
MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection [32.662682253295486]
We propose Multimodal Discriminative Learning for Generalizable AI-generated Image Detection (MiraGegenerator)<n>We apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning.<n>MiraGegenerator achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.
arXiv Detail & Related papers (2025-08-03T00:19:18Z)
Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model [1.8160945635344523]
This study introduces a robust multi-label classification system designed to assign multiple labels to a single image. We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description.
arXiv Detail & Related papers (2025-01-03T03:11:17Z)
Efficient Visualization of Neural Networks with Generative Models and Adversarial Perturbations [0.0]
This paper presents a novel approach for deep visualization via a generative network, offering an improvement over existing methods. Our model simplifies the architecture by reducing the number of networks used, requiring only a generator and a discriminator. Our model requires less prior training knowledge and uses a non-adversarial training process, where the discriminator acts as a guide.
arXiv Detail & Related papers (2024-09-20T14:59:25Z)
Towards Generative Class Prompt Learning for Fine-grained Visual Recognition [5.633314115420456]
Generative Class Prompt Learning and Contrastive Multi-class Prompt Learning are presented. Generative Class Prompt Learning improves visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation.
arXiv Detail & Related papers (2024-09-03T12:34:21Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Active Generation for Image Classification [45.93535669217115]
We propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation.
arXiv Detail & Related papers (2024-03-11T08:45:31Z)
Diversified in-domain synthesis with efficient fine-tuning for few-shot classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions. We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z)
Object-Aware Self-supervised Multi-Label Learning [9.496981642855769]
We propose an Object-Aware Self-Supervision (OASS) method to obtain more fine-grained representations for multi-label learning. The proposed method can be leveraged to efficiently generate Class-Specific Instances (CSI) in a proposal-free fashion. Experiments on the VOC2012 dataset for multi-label classification demonstrate the effectiveness of the proposed method against the state-of-the-art counterparts.
arXiv Detail & Related papers (2022-05-14T10:14:08Z)
Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases. We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z)
Saliency-driven Class Impressions for Feature Visualization of Deep Neural Networks [55.11806035788036]
It is advantageous to visualize the features considered to be essential for classification. Existing visualization methods develop high confidence images consisting of both background and foreground features. In this work, we propose a saliency-driven approach to visualize discriminative features that are considered most important for a given task.
arXiv Detail & Related papers (2020-07-31T06:11:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.