Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition
- URL: http://arxiv.org/abs/2504.09215v1
- Date: Sat, 12 Apr 2025 13:47:24 GMT
- Title: Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition
- Authors: Zhicheng Zhang, Hao Tang, Jinhui Tang,
- Abstract summary: Fine-Grained Bird Recognition (FGBR) has gained increasing attention.<n>Recent studies reveal that the limited receptive field of plain ViT model hinders representational richness.<n>We propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM)
- Score: 35.99227153038734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.
Related papers
- Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting [9.116108409344177]
The source-free cross-domain few-shot learning task aims to transfer pretrained models to target domains utilizing minimal samples.<n>We propose the SeGD-VPT framework, which is divided into two phases.<n>The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity.
arXiv Detail & Related papers (2024-12-01T11:00:38Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - CoinSeg: Contrast Inter- and Intra- Class Representations for
Incremental Segmentation [85.13209973293229]
Class incremental semantic segmentation aims to strike a balance between the model's stability and plasticity.
We propose Contrast inter- and intra-class representations for Incremental (CoinSeg)
arXiv Detail & Related papers (2023-10-10T07:08:49Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition [4.621578854541836]
We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
arXiv Detail & Related papers (2023-08-04T06:41:35Z) - Hierarchical Forgery Classifier On Multi-modality Face Forgery Clues [61.37306431455152]
We propose a novel Hierarchical Forgery for Multi-modality Face Forgery Detection (HFC-MFFD)
The HFC-MFFD learns robust patches-based hybrid representation to enhance forgery authentication in multiple-modality scenarios.
The specific hierarchical face forgery is proposed to alleviate the class imbalance problem and further boost detection performance.
arXiv Detail & Related papers (2022-12-30T10:54:29Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - Progressive Multi-stage Interactive Training in Mobile Network for
Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI)
First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases.
Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages.
Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z) - Disentangled Variational Autoencoder based Multi-Label Classification
with Covariance-Aware Multivariate Probit Model [10.004081409670516]
Multi-label classification is the challenging task of predicting the presence and absence of multiple targets.
We propose a novel framework for multi-label classification that effectively learns latent embedding spaces as well as label correlations.
arXiv Detail & Related papers (2020-07-12T23:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.