Implicit and Explicit Attention for Zero-Shot Learning
- URL: http://arxiv.org/abs/2110.00860v1
- Date: Sat, 2 Oct 2021 18:06:21 GMT
- Title: Implicit and Explicit Attention for Zero-Shot Learning
- Authors: Faisal Alamri and Anjan Dutta
- Abstract summary: We propose implicit and explicit attention mechanisms to address the bias problem in Zero-Shot Learning (ZSL) models.
We conduct comprehensive experiments on three popular benchmarks: AWA2, CUB and SUN.
- Score: 11.66422653137002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the existing Zero-Shot Learning (ZSL) methods focus on learning a
compatibility function between the image representation and class attributes.
Few others concentrate on learning image representation combining local and
global features. However, the existing approaches still fail to address the
bias issue towards the seen classes. In this paper, we propose implicit and
explicit attention mechanisms to address the existing bias problem in ZSL
models. We formulate the implicit attention mechanism with a self-supervised
image angle rotation task, which focuses on specific image features aiding to
solve the task. The explicit attention mechanism is composed with the
consideration of a multi-headed self-attention mechanism via Vision Transformer
model, which learns to map image features to semantic space during the training
stage. We conduct comprehensive experiments on three popular benchmarks: AWA2,
CUB and SUN. The performance of our proposed attention mechanisms has proved
its effectiveness, and has achieved the state-of-the-art harmonic mean on all
the three datasets.
Related papers
- Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Zero-Shot Learning by Harnessing Adversarial Samples [52.09717785644816]
We propose a novel Zero-Shot Learning (ZSL) approach by Harnessing Adversarial Samples (HAS)
HAS advances ZSL through adversarial training which takes into account three crucial aspects.
We demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios.
arXiv Detail & Related papers (2023-08-01T06:19:13Z) - SACANet: scene-aware class attention network for semantic segmentation
of remote sensing images [4.124381172041927]
We propose a scene-aware class attention network (SACANet) for semantic segmentation of remote sensing images.
Experimental results on three datasets show that SACANet outperforms other state-of-the-art methods and validate its effectiveness.
arXiv Detail & Related papers (2023-04-22T14:54:31Z) - DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z) - Dual Cross-Attention Learning for Fine-Grained Visual Categorization and
Object Re-Identification [19.957957963417414]
We propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning.
First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions.
Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs.
arXiv Detail & Related papers (2022-05-04T16:14:26Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning [11.66422653137002]
We propose an attention-based model in the problem settings of Zero-Shot Learning to learn attributes useful for unseen class recognition.
Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches.
arXiv Detail & Related papers (2021-07-30T19:08:44Z) - All the attention you need: Global-local, spatial-channel attention for
image retrieval [11.150896867058902]
We address representation learning for large-scale instance-level image retrieval.
We present global-local attention module (GLAM), which is attached at the end of a backbone network.
We obtain a new feature tensor and, by spatial pooling, we learn a powerful embedding for image retrieval.
arXiv Detail & Related papers (2021-07-16T16:39:13Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.