Hybrid Routing Transformer for Zero-Shot Learning
- URL: http://arxiv.org/abs/2203.15310v1
- Date: Tue, 29 Mar 2022 07:55:08 GMT
- Title: Hybrid Routing Transformer for Zero-Shot Learning
- Authors: De Cheng, Gerong Wang, Bo Wang, Qiang Zhang, Jungong Han, Dingwen
Zhang
- Abstract summary: This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
- Score: 83.64532548391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot learning (ZSL) aims to learn models that can recognize unseen image
semantics based on the training of data with seen semantics. Recent studies
either leverage the global image features or mine discriminative local patch
features to associate the extracted visual features to the semantic attributes.
However, due to the lack of the necessary top-down guidance and semantic
alignment for ensuring the model attending to the real attribute-correlation
regions, these methods still encounter a significant semantic gap between the
visual modality and the attribute modality, which makes their prediction on
unseen semantics unreliable. To solve this problem, this paper establishes a
novel transformer encoder-decoder model, called hybrid routing transformer
(HRT). In HRT encoder, we embed an active attention, which is constructed by
both the bottom-up and the top-down dynamic routing pathways to generate the
attribute-aligned visual feature. While in HRT decoder, we use static routing
to calculate the correlation among the attribute-aligned visual features, the
corresponding attribute semantics, and the class attribute vectors to generate
the final class label predictions. This design makes the presented transformer
model a hybrid of 1) top-down and bottom-up attention pathways and 2) dynamic
and static routing pathways. Comprehensive experiments on three widely-used
benchmark datasets, namely CUB, SUN, and AWA2, are conducted. The obtained
experimental results demonstrate the effectiveness of the proposed method.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Dual Feature Augmentation Network for Generalized Zero-shot Learning [14.410978100610489]
Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes.
Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image.
We propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules.
arXiv Detail & Related papers (2023-09-25T02:37:52Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - Exploiting Semantic Attributes for Transductive Zero-Shot Learning [97.61371730534258]
Zero-shot learning aims to recognize unseen classes by generalizing the relation between visual features and semantic attributes learned from the seen classes.
We present a novel transductive ZSL method that produces semantic attributes of the unseen data and imposes them on the generative process.
Experiments on five standard benchmarks show that our method yields state-of-the-art results for zero-shot learning.
arXiv Detail & Related papers (2023-03-17T09:09:48Z) - QuadFormer: Quadruple Transformer for Unsupervised Domain Adaptation in
Power Line Segmentation of Aerial Images [12.840195641761323]
We propose a novel framework designed for domain adaptive semantic segmentation.
The hierarchical quadruple transformer combines cross-attention and self-attention mechanisms to adapt transferable context.
We present two datasets - ARPLSyn and ARPLReal - to further advance research in unsupervised domain adaptive powerline segmentation.
arXiv Detail & Related papers (2022-11-29T03:15:27Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [119.43299939907685]
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones.
Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention.
We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
arXiv Detail & Related papers (2021-12-16T05:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.