M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition
- URL: http://arxiv.org/abs/2308.02161v1
- Date: Fri, 4 Aug 2023 06:41:35 GMT
- Title: M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition
- Authors: Jiyong Moon, Junseok Lee, Yunju Lee, and Seongsik Park
- Abstract summary: We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
- Score: 4.621578854541836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision Transformers (ViTs) have been actively applied to
fine-grained visual recognition (FGVR). ViT can effectively model the
interdependencies between patch-divided object regions through an inherent
self-attention mechanism. In addition, patch selection is used with ViT to
remove redundant patch information and highlight the most discriminative object
patches. However, existing ViT-based FGVR models are limited to single-scale
processing, and their fixed receptive fields hinder representational richness
and exacerbate vulnerability to scale variability. Therefore, we propose
multi-scale patch selection (MSPS) to improve the multi-scale capabilities of
existing ViT-based models. Specifically, MSPS selects salient patches of
different scales at different stages of a multi-scale vision Transformer
(MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale
cross-attention (MSCA) to model cross-scale interactions between selected
multi-scale patches and fully reflect them in model decisions. Compared to
previous single-scale patch selection (SSPS), our proposed MSPS encourages
richer object representations based on feature hierarchy and consistently
improves performance from small-sized to large-sized objects. As a result, we
propose M2Former, which outperforms CNN-/ViT-based models on several widely
used FGVR benchmarks.
Related papers
- MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation [3.64388407705261]
We propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet.
Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder.
arXiv Detail & Related papers (2024-08-25T06:20:28Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification [15.780372479483235]
PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models.
We propose the Meta Visual Prompt Tuning (MVP) method, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen.
We introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes.
arXiv Detail & Related papers (2023-09-17T13:51:05Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Optimizing Relevance Maps of Vision Transformers Improves Robustness [91.61353418331244]
It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes.
We propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object.
This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks.
arXiv Detail & Related papers (2022-06-02T17:24:48Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - Progressive Multi-stage Interactive Training in Mobile Network for
Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI)
First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases.
Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages.
Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.