ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image
Identification
- URL: http://arxiv.org/abs/2312.16914v1
- Date: Thu, 28 Dec 2023 09:16:27 GMT
- Title: ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image
Identification
- Authors: Ga-Eun Kim, Chang-Hwan Son
- Abstract summary: We propose a novel ROI-aware multiscale cross-attention vision transformer (ROI-ViT)
The proposed ROI-ViT is designed using dual branches, called Pest and ROI branches, which take different types of maps as input: Pest images and ROI maps.
The experimental results show that the proposed ROI-ViT achieves 81.81%, 99.64%, and 84.66% for IP102, D0, and SauTeg pest datasets, respectively.
- Score: 1.9580473532948401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pests captured with imaging devices may be relatively small in size
compared to the entire images, and complex backgrounds have colors and textures
similar to those of the pests, which hinders accurate feature extraction and
makes pest identification challenging. The key to pest identification is to
create a model capable of detecting regions of interest (ROIs) and transforming
them into better ones for attention and discriminative learning. To address
these problems, we will study how to generate and update the ROIs via
multiscale cross-attention fusion as well as how to be highly robust to complex
backgrounds and scale problems. Therefore, we propose a novel ROI-aware
multiscale cross-attention vision transformer (ROI-ViT). The proposed ROI-ViT
is designed using dual branches, called Pest and ROI branches, which take
different types of maps as input: Pest images and ROI maps. To render such ROI
maps, ROI generators are built using soft segmentation and a class activation
map and then integrated into the ROI-ViT backbone. Additionally, in the dual
branch, complementary feature fusion and multiscale hierarchies are implemented
via a novel multiscale cross-attention fusion. The class token from the Pest
branch is exchanged with the patch tokens from the ROI branch, and vice versa.
The experimental results show that the proposed ROI-ViT achieves 81.81%,
99.64%, and 84.66% for IP102, D0, and SauTeg pest datasets, respectively,
outperforming state-of-the-art (SOTA) models, such as MViT, PVT, DeiT,
Swin-ViT, and EfficientNet. More importantly, for the new challenging dataset
IP102(CBSS) that contains only pest images with complex backgrounds and small
sizes, the proposed model can maintain high recognition accuracy, whereas that
of other SOTA models decrease sharply, demonstrating that our model is more
robust to complex background and scale problems.
Related papers
- Benchmarking Image Transformers for Prostate Cancer Detection from Ultrasound Data [3.8208601340697386]
Deep learning methods for classifying prostate cancer (PCa) in ultrasound images typically employ convolutional networks (CNNs) to detect cancer in small regions of interest (ROI) along a needle trace region.
Multi-scale approaches have sought to mitigate this issue by combining the awareness of transformers with a CNN feature extractor to detect cancer from multiple ROIs using multiple-instance learning (MIL)
We present a study of several image transformer architectures for both ROI-scale and multi-scale classification, and a comparison of the performance of CNNs and transformers for ultrasound-based prostate cancer classification.
arXiv Detail & Related papers (2024-03-27T03:39:57Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - ClusVPR: Efficient Visual Place Recognition with Clustering-based
Weighted Transformer [13.0858576267115]
We present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects.
ClusVPR introduces a unique paradigm called Clustering-based weighted Transformer Network (CWTNet)
We also introduce the optimized-VLAD layer that significantly reduces the number of parameters and enhances model efficiency.
arXiv Detail & Related papers (2023-10-06T09:01:15Z) - ROI-based Deep Image Compression with Swin Transformers [14.044999439481511]
Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems.
We propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network.
arXiv Detail & Related papers (2023-05-12T22:05:44Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z) - Hierarchical Similarity Learning for Aliasing Suppression Image
Super-Resolution [64.15915577164894]
A hierarchical image super-resolution network (HSRNet) is proposed to suppress the influence of aliasing.
HSRNet achieves better quantitative and visual performance than other works, and remits the aliasing more effectively.
arXiv Detail & Related papers (2022-06-07T14:55:32Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - HUMUS-Net: Hybrid unrolled multi-scale network architecture for
accelerated MRI reconstruction [38.0542877099235]
HUMUS-Net is a hybrid architecture that combines the beneficial implicit bias and efficiency of convolutions with the power of Transformer blocks in an unrolled and multi-scale network.
Our network establishes new state of the art on the largest publicly available MRI dataset, the fastMRI dataset.
arXiv Detail & Related papers (2022-03-15T19:26:29Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - MOGAN: Morphologic-structure-aware Generative Learning from a Single
Image [59.59698650663925]
Recently proposed generative models complete training based on only one image.
We introduce a MOrphologic-structure-aware Generative Adversarial Network named MOGAN that produces random samples with diverse appearances.
Our approach focuses on internal features including the maintenance of rational structures and variation on appearance.
arXiv Detail & Related papers (2021-03-04T12:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.