Related papers: RevCD -- Reversed Conditional Diffusion for Generalized Zero-Shot Learning

RevCD -- Reversed Conditional Diffusion for Generalized Zero-Shot Learning

URL: http://arxiv.org/abs/2409.00511v1
Date: Sat, 31 Aug 2024 17:37:26 GMT
Title: RevCD -- Reversed Conditional Diffusion for Generalized Zero-Shot Learning
Authors: William Heyden, Habib Ullah, M. Salman Siddiqui, Fadi Al Machot,
Abstract summary: In computer vision, knowledge from seen categories is transferred to unseen categories by exploiting the relationships between visual features and available semantic information. We present a reversed conditional Diffusion-based model (RevCD) that mitigates this issue by generating semantic features from visual inputs. Our RevCD model consists of a cross Hadamard-Addition embedding of a sinusoidal time schedule and a multi-headed visual transformer for attention-guided embeddings.
Score: 0.6792605600335813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Generalized Zero-Shot Learning (GZSL), we aim to recognize both seen and unseen categories using a model trained only on seen categories. In computer vision, this translates into a classification problem, where knowledge from seen categories is transferred to unseen categories by exploiting the relationships between visual features and available semantic information, such as text corpora or manual annotations. However, learning this joint distribution is costly and requires one-to-one training with corresponding semantic information. We present a reversed conditional Diffusion-based model (RevCD) that mitigates this issue by generating semantic features synthesized from visual inputs by leveraging Diffusion models' conditional mechanisms. Our RevCD model consists of a cross Hadamard-Addition embedding of a sinusoidal time schedule and a multi-headed visual transformer for attention-guided embeddings. The proposed approach introduces three key innovations. First, we reverse the process of generating semantic space based on visual data, introducing a novel loss function that facilitates more efficient knowledge transfer. Second, we apply Diffusion models to zero-shot learning - a novel approach that exploits their strengths in capturing data complexity. Third, we demonstrate our model's performance through a comprehensive cross-dataset evaluation. The complete code will be available on GitHub.

Related papers

No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning [0.0]
Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem.<n>This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle.
arXiv Detail & Related papers (2025-09-23T12:54:52Z)
Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection [52.490375806093745]
The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. We introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints. Our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.
arXiv Detail & Related papers (2025-04-09T17:24:05Z)
SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning [0.7420433640907689]
Generalized Zero-Shot Learning (GZSL) recognizes unseen classes by transferring knowledge from the seen classes. This paper introduces a dual strategy to address the generalization gap.
arXiv Detail & Related papers (2023-12-20T15:18:51Z)
DiffAug: Enhance Unsupervised Contrastive Learning with Domain-Knowledge-Free Diffusion-based Data Augmentation [48.25619775814776]
This paper proposes DiffAug, a novel unsupervised contrastive learning technique with diffusion mode-based positive data generation. DiffAug consists of a semantic encoder and a conditional diffusion model; the conditional diffusion model generates new positive samples conditioned on the semantic encoding. Experimental evaluations show that DiffAug outperforms hand-designed and SOTA model-based augmentation methods on DNA sequence, visual, and bio-feature datasets.
arXiv Detail & Related papers (2023-09-10T13:28:46Z)
Federated Zero-Shot Learning for Visual Recognition [55.65879596326147]
We propose a novel Federated Zero-Shot Learning FedZSL framework. FedZSL learns a central model from the decentralized data residing on edge devices. The effectiveness and robustness of FedZSL are demonstrated by extensive experiments conducted on three zero-shot benchmark datasets.
arXiv Detail & Related papers (2022-09-05T14:49:34Z)
Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR) Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z)
Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval [51.53892300802014]
We show that supervised neural information retrieval models are prone to learning sparse attention patterns over passage tokens. Using a novel targeted synthetic data generation method, we teach neural IR to attend more uniformly and robustly to all entities in a given passage.
arXiv Detail & Related papers (2022-04-24T22:36:48Z)
Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z)
Disentangling Semantic-to-visual Confusion for Zero-shot Learning [13.610995960100869]
We develop a novel model called Disentangling Class Representation Generative Adrial Network (DCR-GAN) Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.
arXiv Detail & Related papers (2021-06-16T08:04:11Z)
Transductive Zero-Shot Learning by Decoupled Feature Generation [30.664199050468472]
We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. We propose to decouple tasks of generating realistic visual features and translating semantic attributes into visual cues. We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art.
arXiv Detail & Related papers (2021-02-05T16:17:52Z)
Explanation-Guided Training for Cross-Domain Few-Shot Classification [96.12873073444091]
Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets. We introduce a novel training approach for existing FSC models. We show that explanation-guided training effectively improves the model generalization.
arXiv Detail & Related papers (2020-07-17T07:28:08Z)
Two-Level Adversarial Visual-Semantic Coupling for Generalized Zero-shot Learning [21.89909688056478]
We propose a new two-level joint idea to augment the generative network with an inference network during training. This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains. We evaluate our approach on four benchmark datasets against several state-of-the-art methods, and show its performance.
arXiv Detail & Related papers (2020-07-15T15:34:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.