M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition
- URL: http://arxiv.org/abs/2308.02161v1
- Date: Fri, 4 Aug 2023 06:41:35 GMT
- Title: M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition
- Authors: Jiyong Moon, Junseok Lee, Yunju Lee, and Seongsik Park
- Abstract summary: We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
- Score: 4.621578854541836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision Transformers (ViTs) have been actively applied to
fine-grained visual recognition (FGVR). ViT can effectively model the
interdependencies between patch-divided object regions through an inherent
self-attention mechanism. In addition, patch selection is used with ViT to
remove redundant patch information and highlight the most discriminative object
patches. However, existing ViT-based FGVR models are limited to single-scale
processing, and their fixed receptive fields hinder representational richness
and exacerbate vulnerability to scale variability. Therefore, we propose
multi-scale patch selection (MSPS) to improve the multi-scale capabilities of
existing ViT-based models. Specifically, MSPS selects salient patches of
different scales at different stages of a multi-scale vision Transformer
(MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale
cross-attention (MSCA) to model cross-scale interactions between selected
multi-scale patches and fully reflect them in model decisions. Compared to
previous single-scale patch selection (SSPS), our proposed MSPS encourages
richer object representations based on feature hierarchy and consistently
improves performance from small-sized to large-sized objects. As a result, we
propose M2Former, which outperforms CNN-/ViT-based models on several widely
used FGVR benchmarks.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.