TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning
- URL: http://arxiv.org/abs/2112.08643v1
- Date: Thu, 16 Dec 2021 05:49:51 GMT
- Title: TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning
- Authors: Shiming Chen, Ziming Hong, Guo-Sen Xie, Jian Zhao, Xinge You,
Shuicheng Yan, and Ling Shao
- Abstract summary: Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones.
Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention.
We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
- Score: 119.43299939907685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot learning (ZSL) tackles the novel class recognition problem by
transferring semantic knowledge from seen classes to unseen ones. Existing
attention-based models have struggled to learn inferior region features in a
single image by solely using unidirectional attention, which ignore the
transferability and discriminative attribute localization of visual features.
In this paper, we propose a cross attribute-guided Transformer network, termed
TransZero++, to refine visual features and learn accurate attribute
localization for semantic-augmented visual embedding representations in ZSL.
TransZero++ consists of an attribute$\rightarrow$visual Transformer sub-net
(AVT) and a visual$\rightarrow$attribute Transformer sub-net (VAT).
Specifically, AVT first takes a feature augmentation encoder to alleviate the
cross-dataset problem, and improves the transferability of visual features by
reducing the entangled relative geometry relationships among region features.
Then, an attribute$\rightarrow$visual decoder is employed to localize the image
regions most relevant to each attribute in a given image for attribute-based
visual feature representations. Analogously, VAT uses the similar feature
augmentation encoder to refine the visual features, which are further applied
in visual$\rightarrow$attribute decoder to learn visual-based attribute
features. By further introducing semantical collaborative losses, the two
attribute-guided transformers teach each other to learn semantic-augmented
visual embeddings via semantical collaborative learning. Extensive experiments
show that TransZero++ achieves the new state-of-the-art results on three
challenging ZSL benchmarks. The codes are available at:
\url{https://github.com/shiming-chen/TransZero_pp}.
Related papers
- Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Dual Feature Augmentation Network for Generalized Zero-shot Learning [14.410978100610489]
Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes.
Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image.
We propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules.
arXiv Detail & Related papers (2023-09-25T02:37:52Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - Exploiting Semantic Attributes for Transductive Zero-Shot Learning [97.61371730534258]
Zero-shot learning aims to recognize unseen classes by generalizing the relation between visual features and semantic attributes learned from the seen classes.
We present a novel transductive ZSL method that produces semantic attributes of the unseen data and imposes them on the generative process.
Experiments on five standard benchmarks show that our method yields state-of-the-art results for zero-shot learning.
arXiv Detail & Related papers (2023-03-17T09:09:48Z) - Vision Transformer-based Feature Extraction for Generalized Zero-Shot
Learning [24.589101099475947]
Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute.
In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature.
arXiv Detail & Related papers (2023-02-02T04:52:08Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning [28.330268557106912]
Key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes.
We propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features.
arXiv Detail & Related papers (2022-03-07T05:27:08Z) - TransZero: Attribute-guided Transformer for Zero-Shot Learning [25.55614833575993]
Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones.
We propose an attribute-guided Transformer network, TransZero, to refine visual features and learn attribute localization for discriminative visual embedding representations.
arXiv Detail & Related papers (2021-12-03T02:39:59Z) - Attribute Prototype Network for Zero-Shot Learning [113.50220968583353]
We propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features.
Our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation.
arXiv Detail & Related papers (2020-08-19T06:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.