Dual Cross-Attention Learning for Fine-Grained Visual Categorization and
Object Re-Identification
- URL: http://arxiv.org/abs/2205.02151v1
- Date: Wed, 4 May 2022 16:14:26 GMT
- Title: Dual Cross-Attention Learning for Fine-Grained Visual Categorization and
Object Re-Identification
- Authors: Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, Yi Shan
- Abstract summary: We propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning.
First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions.
Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs.
- Score: 19.957957963417414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-attention mechanisms have shown impressive performance in
various NLP and CV tasks, which can help capture sequential characteristics and
derive global information. In this work, we explore how to extend
self-attention modules to better learn subtle feature embeddings for
recognizing fine-grained objects, e.g., different bird species or person
identities. To this end, we propose a dual cross-attention learning (DCAL)
algorithm to coordinate with self-attention learning. First, we propose
global-local cross-attention (GLCA) to enhance the interactions between global
images and local high-response regions, which can help reinforce the
spatial-wise discriminative clues for recognition. Second, we propose pair-wise
cross-attention (PWCA) to establish the interactions between image pairs. PWCA
can regularize the attention learning of an image by treating another image as
distractor and will be removed during inference. We observe that DCAL can
reduce misleading attentions and diffuse the attention response to discover
more complementary parts for recognition. We conduct extensive evaluations on
fine-grained visual categorization and object re-identification. Experiments
demonstrate that DCAL performs on par with state-of-the-art methods and
consistently improves multiple self-attention baselines, e.g., surpassing
DeiT-Tiny and ViT-Base by 2.8% and 2.4% mAP on MSMT17, respectively.
Related papers
- Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.
We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z) - Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Your "Attention" Deserves Attention: A Self-Diversified Multi-Channel
Attention for Facial Action Analysis [12.544285462327839]
We propose a compact model to enhance the representational and focusing power of neural attention maps.
The proposed method is evaluated on two benchmark databases (BP4D and DISFA) for AU detection and four databases (CK+, MMI, BU-3DFE, and BP4D+) for facial expression recognition.
It achieves superior performance compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-03-23T17:29:51Z) - Counterfactual Attention Learning for Fine-Grained Visual Categorization
and Re-identification [101.49122450005869]
We present a counterfactual attention learning method to learn more effective attention based on causal inference.
Specifically, we analyze the effect of the learned visual attention on network prediction.
We evaluate our method on a wide range of fine-grained recognition tasks.
arXiv Detail & Related papers (2021-08-19T14:53:40Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.