SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization
- URL: http://arxiv.org/abs/2208.14607v1
- Date: Wed, 31 Aug 2022 03:00:07 GMT
- Title: SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization
- Authors: Hongbo Sun, Xiangteng He, Yuxin Peng
- Abstract summary: We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
- Score: 59.732036564862796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained visual categorization (FGVC) aims at recognizing objects from
similar subordinate categories, which is challenging and practical for human's
accurate automatic recognition needs. Most FGVC approaches focus on the
attention mechanism research for discriminative regions mining while neglecting
their interdependencies and composed holistic object structure, which are
essential for model's discriminative information localization and understanding
ability. To address the above limitations, we propose the Structure Information
Modeling Transformer (SIM-Trans) to incorporate object structure information
into transformer for enhancing discriminative representation learning to
contain both the appearance information and structure information.
Specifically, we encode the image into a sequence of patch tokens and build a
strong vision transformer framework with two well-designed modules: (i) the
structure information learning (SIL) module is proposed to mine the spatial
context relation of significant patches within the object extent with the help
of the transformer's self-attention weights, which is further injected into the
model for importing structure information; (ii) the multi-level feature
boosting (MFB) module is introduced to exploit the complementary of multi-level
features and contrastive learning among classes to enhance feature robustness
for accurate recognition. The proposed two modules are light-weighted and can
be plugged into any transformer network and trained end-to-end easily, which
only depends on the attention weights that come with the vision transformer
itself. Extensive experiments and analyses demonstrate that the proposed
SIM-Trans achieves state-of-the-art performance on fine-grained visual
categorization benchmarks. The code is available at
https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.
Related papers
- Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.
Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models.
arXiv Detail & Related papers (2024-09-11T03:04:46Z) - Improved EATFormer: A Vision Transformer for Medical Image Classification [0.0]
This paper presents an improved Algorithm-based Transformer architecture for medical image classification using Vision Transformers.
The proposed EATFormer architecture combines the strengths of Convolutional Neural Networks and Vision Transformers.
Experimental results on the Chest X-ray and Kvasir datasets demonstrate that the proposed EATFormer significantly improves prediction speed and accuracy compared to baseline models.
arXiv Detail & Related papers (2024-03-19T21:40:20Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Point Cloud Learning with Transformer [2.3204178451683264]
We introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT)
Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales.
A multi-level transformer module is designed to aggregate contextual information from different levels of each scale and enhance their interactions.
arXiv Detail & Related papers (2021-04-28T08:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.